soliciting ideas on how to store and query XMLs

(1) By punkish on 2022-10-17 15:45:03 [link] [source]

This is a meta- question about how to solve my project problem with SQLite.

I extract tags, attributes and text from XMLs (currently ~700K) and store them in a SQLite db which is then queryable via an API I've built. Another script periodically (every night) downloads new XMLs and does the extract-tranform-load (ETL) bit, inserting new data into the db. My current SQLite db is about 19GB, with all the indexes, including FTS5 and R*Tree.

Recently I discovered that I had overlooked one critical attribute in my ETL process. So now, I have to redo the whole d%$# thing. Whether I alter the relevant table and update the new column with the attrib, or redo the entire db all over, the pain is going to be similar -- opening all the 700K XMLs, extracting the (forgotten) attrib, inserting it, etc. It is not very terrible, but definitely not enjoyable. Which makes me think -- what if, at a later point, I discover I need to add some other attrib? which leads me to this post -- is there a better way to do this?

Ideally, I would simply load the XMLs (without ETL-ing the data), and query which tag or attrib I want using some kind of xquery. But, I can't do that with SQLite. I could use an XML db (such as BaseX), but in terms of speed and simplicity, these dbs are nowhere in SQLite's league.

An alternative would be to ETL all the tags and their attribs, even if many of them may not be used right now. That way I future-proof my ETL process/db. But, I would need a schemaless db because not only the tags and attribs vary across the XML docs, they may very likely change in the future (new tags may be used).

Or maybe, there is no way. If I want to use SQLite (which I do), I just have to do what I am doing now, and if I discover a new tag or attrib that I have not stored thus far, I just have to redo my db.

Thoughts?

(2) By jose isaias cabrera (jicman) on 2022-10-17 17:20:19 in reply to 1 [link] [source]

what if, at a later point, I discover I need to add some other attrib?

I can't really tell you what to do here, but from experience, this need of adding a new field will happen again. The question is how much data (tags) is there for you to keep or not keep? if you are talking about 20 or so more tags, add them all. If you have 1000's, then, you may want to think about 'possible scenarios' that you may think may need later.

There have been many times where I have asked my boss is that the only field you want? I can add more now. They respond, "No, that is the only field I want. We are not going to need anything else." Three months later, "Hey, I need another field added for that report..."

That's the only comment I can add...

(3) By Karl (kbilleter) on 2022-10-18 01:53:13 in reply to 1 [source]

I've only set this aside as something to look at one day — it hasn't come up for me yet :-). There was a post to this list last year

https://sqlite.org/forum/info/eb86b8d3faebfe09

and another xpath extension that I've come across in searches but never tried out

https://github.com/softace/sqliteodbc/blob/master/xpath.c

(4) By anonymous on 2022-10-18 04:43:31 in reply to 1 [link] [source]

I don't know the specifics of your use case, but have you considered converting the XML to JSON and storing it that way? https://www.sqlite.org/json1.html

(5.1) By ddevienne on 2022-10-19 07:44:17 edited from 5.0 in reply to 1 [link] [source]

This is a meta- question about how to solve my project problem with SQLite.

Many people on this list like these open ended non-pure-SQLite questions :)

I extract tags, attributes and text from XMLs (currently ~700K) [...]

It's not 100% clear whether you store both the XML docs, and the extracted subset, as of now.
At 19GB, I'd figure yes, but who knows, maybe your XMLs are big too.

I had overlooked one critical attribute in my ETL process [...]

I agree with another poster, these things tend to come up, whether we want it or not.

Ideally, I would simply load the XMLs (without ETL-ing the data),
and query which tag or attrib I want using some kind of xquery

Yes, I've done that with JSON docs and JSON1. It's very convenient.
And fast enough for smaller sizes. But depending on the nesting of the
extrated data in the document, and the document sizes and numbers, I'm afraid
this does not scale enough for your use case, to do dynamically.

What you describe is mostly Embarassingly Parallel, while SQLite is purely linear.
So you'd be better off doing the ETL in parallel elsewhere (i.e. compiled language), and
store the results in SQLite (linearly). You FAN OUT for the ETL, and FAN IN for the INSERTs.

OTOH, maybe the speed is not essential, if needs to happen only once in a while.
Still, when I used JSON1 for ETL, I did materialize the result in real tables,
as doing it dynamically on-the-fly was too slow. One can capture the ETL rules as
views for examples, and materialize those via CTAS statements. If a new attribute
comes, re-create the view, then drop and recreate the materialized table.

~~The only problem with this approach, is that no one has written the XML equivalent of JSON1.~~ (Correction from Karl here)
With JSON1's example, it's not super difficult, yet still quite a bit of work...

An alternative would be to ETL all the tags and their attribs [...]

That's actually something I've thought of too, extracting the infoset of XML documents,
and storing it in a highly normalized and as-compact-as-possible way in SQLite, for querying.
Since SQLite has hierarchical queries with CTEs, it's not quite XPath or XQuery, but can
probably do the kind of ETL you're interested in. All namespaces and elements/attributes/etc...
would be normalized, for compactness, the hierarchy recorded via Foreign Keys, so that's
still schema-less, yet schema-aware.

Also, since SQLite is flexibly typed, you have the choice of storing the XML InfoSet as
text always, or use XSD-schema-informed typing post-validation (i.e. the PSVI), to decide
which SQLite type to use to store the values. If you have lots of numerical data, that could
be interesting. In my own ETL, I extract elements of varying names, but of given XSD types.

Thoughts?

See above :). Hopefully that's not too rambling, and useful to you. Cheers, --DD

(6) By Víctor (wzrlpy) on 2022-10-19 06:35:33 in reply to 1 [link] [source]

Besides the well known ODBC driver, there is an xpath virtual table extension at http://www.ch-werner.de/sqliteodbc/ that may ease your pain of parsing your XML.

(7.1) By punkish on 2022-10-19 14:28:59 edited from 7.0 in reply to 1 [link] [source]

thanks everyone for very helpful suggestions… I think I will go with ETLing all the tags and their attribs, even the ones that I don't use right now. Then, if my searching needs change later, I only have to create new indexes as needed.

I have only recently discovered the content=table option of FTS5 (still learning about it), so that will save me some space. While space is not a constraint, the time taken to back up the data would go down with a smaller db file.

(8) By midijohnny on 2022-10-20 18:24:01 in reply to 7.1 [link] [source]

Just trying to understand if you are using SQLite to sort-of serialize the XML tree structure in a generic set of tables - or whether you have app-specific tables which you map your XML onto?

That is: do you have tables (say) called "elements" and "attributes" with join-tables etc to hold the structure of XML - or are you transforming your XML into (say) "Person", "Phone", "Address" etc?

I'm just curious - because I'm also working with XML->DB.

FWIW - I'm using XSLT to extract stuff and transform to a generic XML format - which maps to my (app-specific) tables.

I call the XSLT from a host language (Python or C) and then walk through the results to trigger the relevant INSERT/REPLACE SQL needed to load the data.

Hence my request for ENH here: https://sqlite.org/forum/info/8fd9d9b7d314bb1b But not sure this is anything that would suit what you need to do or not.

(9) By punkish on 2022-10-21 09:58:49 in reply to 8 [link] [source]

I am extracting certain tags and their attributes. The tags map to tables, and the attribs map to columns in those tables. One specific tag is the "principal" tag, and hence, it acts as the central table in the schema. The other tables (mapped from other tags) are related to the principal table via FKs. All the extraction and building tables is done in nodejs.

Imagine an XML like so

<pt id="6TYFR" name="foo">1st pt has
    <st1 id="BH765" name="foo_st1">1st st</st1> and 
    <st2 id="9KI6O" name="foo_st2">2nd st</st2>
    and maybe more
</pt>

The above results in a schema like so

CREATE TABLE pt (
    id INTEGER PRIMARY KEY,
    pt_id TEXT,
    name TEXT,
    fulltext TEXT
);

CREATE TABLE st1 (
    id INTEGER PRIMARY KEY,
    st1_id TEXT,
    name TEXT,
    fulltext TEXT
);

CREATE TABLE st2 (
    id INTEGER PRIMARY KEY,
    st2_id TEXT,
    name TEXT,
    fulltext TEXT
);

and data like so

table: pt

| id | pt_id | name | fulltext                                    |
|----|-------|------|---------------------------------------------|
| 1  | 6TYFR | foo  | 1st pt has 1st st and 2nd st and maybe more |

table: st1

| id | st1_id | pt_id | name    | fulltext |
|----|--------|-------|---------|----------|
| 1  | BH765  | 6TYFR | foo_st1 | 1st st   |

table: st2

| id | st2_id | pt_id | name    | fulltext |
|----|--------|-------|---------|----------|
| 1  | 9KI6O  | 6TYFR | foo_st2 | 2nd st   |

My challenge is the future-proof my ETL process so if I discover I have to extract some new tag or attrib, I have to do minimal work to accomplish that. Short of a full-fledged implementation of xpath within SQLite, which would make it into an XML db, something it is not intended to be, one way would be to extract all the tags and their attribs, and then index for querying only those that are needed. Another way would be to store the XMLs also as a column (in the table pt, on in a separate db that can be ATTACHed when needed), and then add a new column when needed with that info. Right now I have more than 700000 XMLs, so the become very unwieldy on the disk.

(11.1) By midijohnny on 2022-10-22 13:33:50 edited from 11.0 in reply to 9 [link] [source]

Understood. I wonder then whether you can use XSLT - called from your host language - to generate ALTER TABLE statements - as well as generating REPLACE INTO DML to load the data.

So: alter the XSLT to add in a new mapping, then calling the XSLT to generate the REPLACE statements.

I'm doing the latter (i.e. not the ALTER table, just the INSERT/REPLACE) in my app.

Here's an example (replaced table names, column names here) of how I process the XSLT into SQL statements. The XSLT I have in my case, takes the input XML - transforms it to my 'standard' format (see my ENH request) - so I can load tables that way.

The point here: is I have a single method to load all the tables - having pushed the XML->column name mapping out to the XSLT. There is additional stuff you can get rid of here: the temporary table method is just to let me show the before/after the load.

One thing I was thinking - and haven't done yet - is actually hold the logic of the XSLT in SQLite - then dynamically created the XSLT. Not sure if that is worth it. (The idea being - my mapping is then also held in DB).

Why XSLT? Because that at least makes my app somewhat host-language independent - assuming I have an XSLT library available.

Note: this method below is called after the XSLT has transformed the input XML to my XML format for tables/columns.

# python3
def load_db(con,xml):
    cur=con.cursor()
    cur.execute("DROP TABLE IF EXISTS temp.new_rows")
    cur.execute("CREATE TEMPORARY TABLE new_rows AS SELECT <my-cols> FROM <table1> NATURAL JOIN <table2>;")
    sql="INSERT OR %s INTO main.%s(%s) VALUES (%s)"
    for table in xml.xpath("/tables/table"):
        tablename="%s"%table.get('name')
        sql_type = table.get('type') if table.get('type') else "IGNORE"
        log.debug("Processing records for table:%s"%tablename)
        cols=[c.text or None for c in table.xpath('columns/column')]
        placeholders=["?"]*len(cols)
        pstat=sql%(sql_type,tablename,",".join(cols),",".join(placeholders))
        log.debug(pstat)
        try:
            cur.execute("BEGIN TRANSACTION")
            for row in table.xpath('rows/row'):
                values=[v.text for v in row.xpath('values/value') ]
                res=cur.execute(pstat,values)
            for row in cur.execute("SELECT <my-cols> FROM <table1> NATURAL JOIN <table2> EXCEPT SELECT * FROM temp.new_rows"):
                log.info("Added %s : %s"%(row['<col1>'],row['<col2>']))
            cur.execute("COMMIT")
        except Exception as e:
            cur.execute("ROLLBACK")
            log.error(e)
            log.error(sql)
            log.error("Table=%s, Values=%s"%(tablename,values))
            con.close()

(10) By Karl (kbilleter) on 2022-10-22 04:30:50 in reply to 1 [link] [source]

Another alternative is to "flatten" the XML to make it more amenable to grepping/text search. For quick jobs I often use (the orphaned?) xml2 / 2xml