Performance Regression After JSON Caching Change

(1) By SeverKetor on 2023-07-26 20:44:02 [source]

I saw the new JSON changes got merged into the trunk and was hoping it might improve a big upsert I do, so I updated and gave it some tests.

The JSON being parsed is essentially [{"appid": 12345, "name": "Example}, {"appid": 12346, "name": Something Else"}, <repeat for 169949 apps>] and the table is defined as CREATE TABLE app_names (AppID INTEGER PRIMARY KEY, Name TEXT). The upsert was INSERT INTO app_names SELECT Value->>'appid', Value->'name' FROM json_each(?) WHERE 1 ON CONFLICT DO UPDATE SET Name=excluded.Name WHERE Name!=excluded.Name

Testing it (with an in-memory DB and empty table, and an on-disk DB with the real table) shows about a 20% increase in time required for the query. Seems unfortunate, but I'm guessing it's because the caching doesn't get used much so it just slows the parsing down in this case?

Admittedly I'm going to get around to moving the JSON parsing out of SQL, so it's not like it'll eventually matter either way, but I decided it was worth mentioning my findings here.

(2) By Richard Hipp (drh) on 2023-07-26 20:52:01 in reply to 1 [link] [source]

The new changes do not touch the json_each() table-valued function. That is not something that has been optimized recently.

(3.2) By SeverKetor on 2023-07-26 22:35:12 edited from 3.1 in reply to 2 [link] [source]

Huh, alright.

Something is definitely slowing it down though; I repeated the tests multiple times and even had to flip-flop the versions (compiled from source in the exact same way) after I realized a mistake was adding a constant bit of time to the tests.

Should you want to test it yourself, the JSON data is just from https://api.steampowered.com/ISteamApps/GetAppList/v2/ and I was stripping off the apps and applist parts of the result in Python.

Edited to add: I used the July 21st snapshot as the pre-change version, and check-in 01d52232 for the post-change

(4) By Richard Hipp (drh) on 2023-07-26 23:49:32 in reply to 1 [link] [source]

I concur that there is about a 16% performance reduction in the particular query you are running, using cachegrind. Comparison output is shown here (temporarily - the link will be taken down at some point):

https://sqlite.org/tmp/json-performance-diff-20230726.html

You'll notice that the JSON parser is quite a bit faster. But other things are slowing it down. We are still a month away from feature-freeze so perhaps I'll be able to fix it.

Note that the performance on the query I was working to optimize, which is an production query for a clients, is about twice as fast. They are not using json_each, however.

I generated a sample database file like this:

.open --new x1.db
CREATE TABLE t1(json TEXT);
WITH RECURSIVE c(x) AS (VALUES(1) UNION ALL SELECT x+1 FROM c WHERE x<200000)
INSERT INTO t1
 SELECT json_group_array(json_object('appid',x,'name','Example-'||x))
   FROM c;

That data contains a single table with a single row that contains one huge JSON array with 200K entries, each of which is an object. The performance test is:

.open x1.db
CREATE TEMP TABLE app_names (AppID INTEGER PRIMARY KEY, Name TEXT);
INSERT INTO app_names SELECT Value->>'appid', Value->'name'
  FROM json_each((SELECT json FROM t1)) 
 WHERE 1
 ON CONFLICT DO UPDATE SET Name=excluded.Name WHERE Name!=excluded.Name;

(5) By Richard Hipp (drh) on 2023-07-27 00:25:13 in reply to 1 [link] [source]

The problem should be resolved as of check-in 837f2907e10b026f. Please report back if you find otherwise.

My (cachegrind-based) performance comparison is here:

https://sqlite.org/tmp/json-performance-diff-20230726-2.html

(6) By SeverKetor on 2023-07-27 01:04:54 in reply to 5 [link] [source]

Yep, that seems to have fixed it. Thanks!