Faster Inserts from JSON?

(1) By SeverKetor on 2022-02-20 16:09:18 [source]

I have a couple personal projects that make use of 100,000 - 1,000,000 row SQLite tables, both due to getting info from online sources, both needing to have that info refetched every once in a while.

I was able to speed up one to 295,000 inserts per second (as an aside, SQLite is impressively fast, especially since that rate is already including time spent recreating the index on the table) by compiling and using the CSV extension instead of relying on Python. However, the other project gets its data from an API that returns data in JSON as an array of objects. I've tried using json_each() and a couple json_extract()s to insert the data, but it ended up taking 2-4x as long as in Python using orjson to parse the data, then using a generator and bulk-inserts with Connection.executemany(). The latter method gets about 270,000 upserts per second with no indexes. Either method is still easily fast enough for my purposes, but I'd like it to be faster anyway, if only for learning.

So, my question is: does anyone have experience with optimizing inserts from JSON, and if so, what did you do?

(2) By MBL (UserMBL) on 2022-02-20 19:24:30 in reply to 1 [link] [source]

why not storing the whole JSON structure in a single column? You can access (read) with JSON-functions already and beginning with 3.38.0 also with the new operators -> and ->> . Creating a simple VIEW may solve your task to read many values out of the JSON structures this way; assuming you know the structure.

You did not give any detail about the structure of your JSON content and no other insights, which makes it difficult to give you more specific help.

My hint: If storage speed matters then store your input 'as is' and spent some time and cpu power while reading.

(3) By SeverKetor on 2022-02-20 20:15:52 in reply to 2 [link] [source]

"You did not give any detail about the structure of your JSON content" I kind of did, but I didn't do it very well. Fresh from the request it looks like:

{"applist":
    "apps": [
        {"appid": 123456, "name": "Some name"},
        {"appid": 123457, "name": "Some other name"},
        repeat
     ]
}

Storing the whole structure doesn't really work well when I want to use it. If I want to know what the name of app 123 is, SQLite would have to scan over the whole thing until it found it (or in this example, found nothing).

(4.2) By Harald Hanche-Olsen (hanche) on 2022-02-20 21:12:56 edited from 4.1 in reply to 3 [link] [source]

One possibility might be to convert the data to csv format first, for example using jq (github link), and then import the result into SQLite, Note that you can import from the output of a command, saving you the trouble of saving the result to an intermediate file.

(5) By SeverKetor on 2022-02-20 22:18:49 in reply to 4.2 [link] [source]

Good idea, but it seems to lose out by a bit. It took more than a second for an in-memory database (.timer doesn't work for it, apparently). In comparison, with json_each it took 1.42 seconds, and orjson + bulk inserts took 0.93 seconds, both for rebuilding the table from nothing. Doing upserts instead took those versions down to roughly 0.65 seconds and 0.4 seconds respectively.

I did figure out one obvious optimization though: I was doing "json_each(?->'applist'->'apps')". Removing the first 19 and last 2 characters from the JSON string and then using "json_each(?)" is faster, but only by one or two tenths of a second.

Just in case, the full query is INSERT INTO apps SELECT value->>'appid', value->>'name' FROM json_each(?) WHERE 1 ON CONFLICT DO UPDATE SET Name = excluded.Name WHERE Name != excluded.Name.

(6) By ddevienne on 2022-02-21 11:45:15 in reply to 5 [link] [source]

SELECT value->>'appid', value->>'name' FROM json_each(?)

One reason JSON will be a bit slower, is that the JSON is parsed three times.
Once in json_each, producing small {"appid": "...", "name": "..." } values,
then all those small JSON values are parsed again for each op--> in the SELECT clause.

Above I wrote 3x, because your document seems to be mostly composed of those small JSON values
produced by the json_each table-valued function, i.e. it returns in your case almost
the entire JSON doc infoset (to use an XML world term).