100x jump in query time with IN operator

(1) By Alexander Prinzhorn (Prinzhorn) on 2021-10-25 08:00:52 [link] [source]

I have a query that uses WHERE id IN (1,2,3,...) where the list (1,2,3,...) is dynamically generated from an array of integers (not using parameters). Now I have a particular query that takes roughly 500ms with 26623 ids but 50s (100x slower) with 26624 ids. In a different scenario the jump in query time happens between 3071 and 3072 IDs.

This happens in Node.js via better-sqlite3 (with its default compile options https://github.com/JoshuaWise/better-sqlite3/blob/23c56aa77e44689e56c56370345e3c3cc7895128/docs/compilation.md#bundled-configuration) with SQLite 3.36.0.

This query uses a custom virtual table and I'm not sure how to reproduce it without it and if it is part of the issue. This means I can't easily provide steps to reproduce it with vanilla SQLite. But I will provide as much information as I can.

The database contains normalized HTTP traffic. What the query does is get all unique URL parameters with a list of unique values. The search_params virtual table parses a query string and yields all name/value pairs.

This query takes 500ms:

SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (1,2,3,...,26623)
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC

If I change the range of IDs up to 26624 it takes 50s. This on its own is already interesting. I must be running into some limitation and maybe it starts paging or something?

Here's what I also found, which is even more interesting. If I make search_params(search) constant, the jump happens from 3071 to 3072 (500x from 10ms to 5s).

SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params('?foo=bar') AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (1,2,3,...,3071)
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC

Without the IN condition (running the query on all 31k flows) it takes roughly as long as with 26623 flows.

Now my workaround is to use json_each like this:

SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (SELECT value FROM json_each('[1,2,3,...,26623]'))
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC

Which is just as fast but seems to scale effortlessly (and as a bonus can make use of parameters for any number of IDs).

EXPLAIN returns 80k and 9k rows respectively (that sounds like a lot). With the json_each implementation it's like 70 (not k, just 70). Maybe I'm running into problems with the generic virtual table implementation of xBestIndex or xFilter (https://github.com/JoshuaWise/better-sqlite3/blob/master/src/util/custom-table.lzz)? But why would that have any influence on how flows.id is queried?

You can download the EXPLAIN results here:

https://drive.protonmail.com/urls/MP56BT56F0#S0VWMC1hD29I
https://drive.protonmail.com/urls/QHZV7KWHV0#Yf0DvjP97B2A
https://drive.protonmail.com/urls/8VZEQ3JXFR#5P6EEle67dCF
https://drive.protonmail.com/urls/5B3D7M3WS8#5xib2GAKU03f

(2) By Gunter Hick (gunter_hick) on 2021-10-25 12:06:56 in reply to 1 [source]

SQLite NGQP is a cost based query planner. The IN () operator with a list of literal values gets implemented as a kind of temporary table; sometimes SQLite decides to create an index and do lookups, other times it decides to use that table as the outermost loop of the query.

EXPLAIN QUERY PLAN should show that in a more concise manner.

If compiled in DEBUG mode mith WHERETRACE enabled, the .wheretrace command will show how SQLite NGQP reaches its plan. Essential input is the return values from the xBestIndex method of your virtual table, especially the "number of rows" and the "estimated cost". It is paramount to deliver accurate estimates. Cost should reflect processing cost relative to SQLite native tables.

Note that you can name the IN table by making it a CTE and CROSS JOIN to force the query plan that works fast.

(3.1) By Gunter Hick (gunter_hick) on 2021-10-25 12:07:14 edited from 3.0 in reply to 1 [link] [source]

Deleted