SQLite Forum

Parse sortable date from "Sat Apr 22 1:01:51.887"
Login

Parse sortable date from "Sat Apr 22 1:01:51.887"

(1) By BB1234 on 2023-04-26 07:15:01 [link] [source]

I have a SQLite 3.38.2 table that has XML tags in a column:

with cte(xml_tag) as (values 
  ('<Event time="Sat Apr 22 1:01:51.887" type="Debug" thread="2164: Main CIM worker thread" elapsed="1" function="Geodatabase.Cursor" code="EndCursor">'), 
  ('<Event time="Mon Feb 22 1:01:51.883" type="Debug" thread="2164: Main CIM worker thread" elapsed="23" function="Geodatabase.Cursor" code="EndCursor">'), 
  ('<Event time="Sun Jan 22 1:01:51.874" type="Debug" thread="2164: Main CIM worker thread" elapsed="456" function="Geodatabase.Cursor" code="EndCursor">'), 
  ('<Event time="Sat Apr 22 1:01:51.846" type="Debug" thread="2164: Main CIM worker thread" elapsed="7890" function="Geodatabase.Cursor" code="EndCursor">'))
select * from cte

db<>fiddle

I want to extract the values from the EVENT TIME tag as sortable dates. Something like this:

EVENT TIME
----------------------
2023-04-22 1:01:51.887
2023-02-22 1:01:51.883
2023-01-22 1:01:51.874
2023-04-22 1:01:51.846

I'm flexible on the date format, as long as the dates can be sorted ascending. It's assumed that the year is the current year since the year isn't specified in the source data.

How can I extract the values from the EVENT TIME tag using a SQL query?

(2) By Phil G (phil_g) on 2023-04-26 09:16:23 in reply to 1 [link] [source]

Something like this, perhaps (but test it with more actual data before assuming it will always be right just because it handles this small sample)…

WITH cte(xml_tag) AS (VALUES 
  ('<Event time="Sat Apr 22 1:01:51.887" type="Debug" thread="2164: Main CIM worker thread" elapsed="1" function="Geodatabase.Cursor" code="EndCursor">'), 
  ('<Event time="Mon Feb 22 1:01:51.883" type="Debug" thread="2164: Main CIM worker thread" elapsed="23" function="Geodatabase.Cursor" code="EndCursor">'), 
  ('<Event time="Sun Jan 22 1:01:51.874" type="Debug" thread="2164: Main CIM worker thread" elapsed="456" function="Geodatabase.Cursor" code="EndCursor">'), 
  ('<Event time="Sat Apr 22 12:01:51.846" type="Debug" thread="2164: Main CIM worker thread" elapsed="7890" function="Geodatabase.Cursor" code="EndCursor">'))
SELECT STRFTIME('%Y') || '-' || CASE SUBSTR(xml_tag, 18, 3)
    WHEN 'Jan' THEN '01' WHEN 'Feb' THEN '02' WHEN 'Mar' THEN '03' WHEN 'Apr' THEN '04'
    WHEN 'May' THEN '05' WHEN 'Jun' THEN '06' WHEN 'Jul' THEN '07' WHEN 'Aug' THEN '08'
    WHEN 'Sep' THEN '09' WHEN 'Oct' THEN '10' WHEN 'Nov' THEN '11' WHEN 'Dec' THEN '12' ELSE '??' END
  || '-' || SUBSTR(xml_tag, 22, 3)
  || CASE SUBSTR(xml_tag, 36, 1) WHEN '"' THEN '0' || SUBSTR(xml_tag, 25, 11) ELSE SUBSTR(xml_tag, 25, 12) END
  AS event_time, xml_tag FROM cte;

A few points to note here:

I've taken your assumption "that the year is the current year" literally, and asked SQLite to provide the current year with STRFTIME('%Y') - replace that with a literal string or something else if you'd rather determine "current year" differently (perhaps based on the date you obtained the data?)

I'm assuming the month name will always be in English, and always 3 characters long. If that's not true, this will break.

I've taken the liberty of modifying the time in the last entry of your example, to ensure that this will also work with double-digit hours (since your input format doesn't seem to include a leading zero on single-digit hours to ensure consistent length of the time string). It's not particularly robust, just a simple check to see if a double-quote appears in a specific position and using that to assume the length of the string.

If the day format also omits leading zeroes for single-digit days (it's impossible to tell from your example, since all seem to be on the 22nd), then a similar check and adaption would be needed there (which then has a knock-on effect of changing the SUBSTR offsets for the time portion, and further complicates things)… for relative simplicity (and mostly through laziness) I've ignored that possibility here.

All of this relies on substring extraction and will break unless the input string is always in exactly the same format as you've shown. In particular, XML would allow you to change the order of attributes (perhaps 'type' could come before 'time', etc) without changing the meaning, but doing so would break this SQL parsing of it, since it relies on the date/time string being in exactly the expected position in the overall string. If the input XML may have its attributes in a different order, then you'll need a different way to extract just the date/time string before you format it using something like this.


Of course, the better solution would be to adapt whatever is generating this XML (if that's possible), so it will provide the date/time in a more useful format in the first place!

(3.1) By Phil G (phil_g) on 2023-04-26 10:07:56 edited from 3.0 in reply to 2 [link] [source]

If the day format also omits leading zeroes… mostly through laziness, I've ignored that possibility here.

Or maybe I'm not so lazy after all, I couldn't resist tweaking it a little more. This version should handle having single or double-digit for both the day and the hour:

WITH cte(xml_tag) AS (VALUES 
  ('<Event time="Sat Apr 2 1:01:51.887" type="Debug" thread="2164: Main CIM worker thread" elapsed="1" function="Geodatabase.Cursor" code="EndCursor">'), 
  ('<Event time="Mon Feb 22 1:01:51.883" type="Debug" thread="2164: Main CIM worker thread" elapsed="23" function="Geodatabase.Cursor" code="EndCursor">'), 
  ('<Event time="Sun Jan 22 1:01:51.874" type="Debug" thread="2164: Main CIM worker thread" elapsed="456" function="Geodatabase.Cursor" code="EndCursor">'), 
  ('<Event time="Sat Apr 22 12:01:51.846" type="Debug" thread="2164: Main CIM worker thread" elapsed="7890" function="Geodatabase.Cursor" code="EndCursor">'))
SELECT STRFTIME('%Y') || '-' || CASE SUBSTR(xml_tag, 18, 3)
    WHEN 'Jan' THEN '01' WHEN 'Feb' THEN '02' WHEN 'Mar' THEN '03' WHEN 'Apr' THEN '04'
    WHEN 'May' THEN '05' WHEN 'Jun' THEN '06' WHEN 'Jul' THEN '07' WHEN 'Aug' THEN '08'
    WHEN 'Sep' THEN '09' WHEN 'Oct' THEN '10' WHEN 'Nov' THEN '11' WHEN 'Dec' THEN '12' ELSE '??' END
  || '-' || CASE SUBSTR(xml_tag, 23, 1) WHEN ' ' THEN '0' || SUBSTR(xml_tag, 22, 2) ELSE SUBSTR(xml_tag, 22, 3) END
  || CASE LENGTH(TRIM(SUBSTR(xml_tag, 24, 13), ' "')) WHEN 11 THEN '0' ELSE '' END || TRIM(SUBSTR(xml_tag, 24, 13), ' "')
  AS event_time, xml_tag FROM cte;

It's still fragile overall though, and will break if the input format deviates even a little from the examples you've shown.

[Edit: minor adjustments to more closely match the version in my previous post, since I initially based this on an earlier version I had been playing with]

(5) By BB1234 on 2023-04-26 17:42:48 in reply to 3.1 [link] [source]

Thanks Phil and Adrian! You make good points and your scrips are going to be very helpful. Investigating.

(4) By Adrian Ho (lexfiend) on 2023-04-26 10:23:13 in reply to 1 [source]

@phil_g has made a valiant effort to do a pure-SQL implementation, but as he says in his reply, it's inherently fragile.

I'd go even further and say that dumping what look like raw XML tuples into a DB and trying to dynamically regenerate its component values on demand is just make-work. Whoever did this should've just pre-processed the XML and INSERTed the data in individual columns like a sane programmer, something like this:

#!/usr/bin/env python3
from datetime import datetime
import sqlite3
from xml.dom.minidom import parseString

con = sqlite3.connect("test.db")
cur = con.cursor()
cur.execute("DROP TABLE IF EXISTS real_data")
cur.execute("CREATE TABLE real_data (time TEXT, type TEXT, thread TEXT, elapsed INTEGER, function TEXT, code TEXT)")
new_data = []
for row in cur.execute("SELECT xml_tag FROM cte"):
    # Event tag is missing its closing counterpart
    doc = parseString(row[0] + "</Event>")
    new_row = {}
    for (name, val) in doc.getElementsByTagName("Event")[0].attributes.items():
        if name == "time":
            new_row[name] = datetime.strptime(val, "%a %b %d %H:%M:%S.%f").replace(year=2023)
        else:
            new_row[name] = val
    new_data.append(new_row)
cur.executemany("INSERT INTO real_data VALUES (:time, :type, :thread, :elapsed, :function, :code)", new_data)
con.commit()
con.close()

which does this:

$ sqlite3 -table test.db 'select * from cte'
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                        xml_tag                                                                         |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+
| <Event time="Sat Apr 22 1:01:51.887" type="Debug" thread="2164: Main CIM worker thread" elapsed="1" function="Geodatabase.Cursor" code="EndCursor">    |
| <Event time="Mon Feb 22 1:01:51.883" type="Debug" thread="2164: Main CIM worker thread" elapsed="23" function="Geodatabase.Cursor" code="EndCursor">   |
| <Event time="Sun Jan 22 1:01:51.874" type="Debug" thread="2164: Main CIM worker thread" elapsed="456" function="Geodatabase.Cursor" code="EndCursor">  |
| <Event time="Sat Apr 22 1:01:51.846" type="Debug" thread="2164: Main CIM worker thread" elapsed="7890" function="Geodatabase.Cursor" code="EndCursor"> |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+

$ ./test.py

$ sqlite3 -table test.db 'select * from real_data'
+----------------------------+-------+------------------------------+---------+--------------------+-----------+
|            time            | type  |            thread            | elapsed |      function      |   code    |
+----------------------------+-------+------------------------------+---------+--------------------+-----------+
| 2023-04-22 01:01:51.887000 | Debug | 2164: Main CIM worker thread | 1       | Geodatabase.Cursor | EndCursor |
| 2023-02-22 01:01:51.883000 | Debug | 2164: Main CIM worker thread | 23      | Geodatabase.Cursor | EndCursor |
| 2023-01-22 01:01:51.874000 | Debug | 2164: Main CIM worker thread | 456     | Geodatabase.Cursor | EndCursor |
| 2023-04-22 01:01:51.846000 | Debug | 2164: Main CIM worker thread | 7890    | Geodatabase.Cursor | EndCursor |
+----------------------------+-------+------------------------------+---------+--------------------+-----------+

(6) By Holger J (holgerj) on 2023-04-26 17:46:38 in reply to 1 [link] [source]

Actually, the XML standard REQUIRES date and datetime values to be written according to ISO8601, which can easily be read by SQLite.

So I would think it best to refuse the XML document and request a correction to adhere to standards.

Maybe you feel like not being in a position to do so. Still, it's important to promote standards and confront ignorant people.

Regards, Holger

(7) By BB1234 on 2023-04-26 19:29:11 in reply to 6 [link] [source]

Good thinking. I actually submitted a request for enhancement about that a few hours ago:

Diagnostic Monitor Logs — Sortable date format