SQLite Forum

Timeline
Login

1 event by user Prinzhorn

2021-10-25
08:00 Post: 100x jump in query time with IN operator artifact: e7b9a0d7db user: Prinzhorn

I have a query that uses WHERE id IN (1,2,3,...) where the list (1,2,3,...) is dynamically generated from an array of integers (not using parameters). Now I have a particular query that takes roughly 500ms with 26623 ids but 50s (100x slower) with 26624 ids. In a different scenario the jump in query time happens between 3071 and 3072 IDs.

This happens in Node.js via better-sqlite3 (with its default compile options https://github.com/JoshuaWise/better-sqlite3/blob/23c56aa77e44689e56c56370345e3c3cc7895128/docs/compilation.md#bundled-configuration) with SQLite 3.36.0.

This query uses a custom virtual table and I'm not sure how to reproduce it without it and if it is part of the issue. This means I can't easily provide steps to reproduce it with vanilla SQLite. But I will provide as much information as I can.

The database contains normalized HTTP traffic. What the query does is get all unique URL parameters with a list of unique values. The search_params virtual table parses a query string and yields all name/value pairs.

This query takes 500ms:

SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (1,2,3,...,26623)
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC

If I change the range of IDs up to 26624 it takes 50s. This on its own is already interesting. I must be running into some limitation and maybe it starts paging or something?

Here's what I also found, which is even more interesting. If I make search_params(search) constant, the jump happens from 3071 to 3072 (500x from 10ms to 5s).

SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params('?foo=bar') AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (1,2,3,...,3071)
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC

Without the IN condition (running the query on all 31k flows) it takes roughly as long as with 26623 flows.

Now my workaround is to use json_each like this:

SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (SELECT value FROM json_each('[1,2,3,...,26623]'))
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC

Which is just as fast but seems to scale effortlessly (and as a bonus can make use of parameters for any number of IDs).

EXPLAIN returns 80k and 9k rows respectively (that sounds like a lot). With the json_each implementation it's like 70 (not k, just 70). Maybe I'm running into problems with the generic virtual table implementation of xBestIndex or xFilter (https://github.com/JoshuaWise/better-sqlite3/blob/master/src/util/custom-table.lzz)? But why would that have any influence on how flows.id is queried?

You can download the EXPLAIN results here:

  • https://drive.protonmail.com/urls/MP56BT56F0#S0VWMC1hD29I
  • https://drive.protonmail.com/urls/QHZV7KWHV0#Yf0DvjP97B2A
  • https://drive.protonmail.com/urls/8VZEQ3JXFR#5P6EEle67dCF
  • https://drive.protonmail.com/urls/5B3D7M3WS8#5xib2GAKU03f