SQLite Forum

100x jump in query time with IN operator
Login
I have a query that uses `WHERE id IN (1,2,3,...)` where the list `(1,2,3,...)` is dynamically generated from an array of integers (_not_ using parameters). Now I have a particular query that takes roughly 500ms with 26623 ids but 50s (100x slower) with 26624 ids. In a different scenario the jump in query time happens between 3071 and 3072 IDs.

This happens in Node.js via better-sqlite3 (with its default compile options https://github.com/JoshuaWise/better-sqlite3/blob/23c56aa77e44689e56c56370345e3c3cc7895128/docs/compilation.md#bundled-configuration) with SQLite 3.36.0.

This query uses a custom virtual table and I'm not sure how to reproduce it without it and if it is part of the issue. This means I can't easily provide steps to reproduce it with vanilla SQLite. But I will provide as much information as I can.

The database contains normalized HTTP traffic. What the query does is get all unique URL parameters with a list of unique values. The `search_params` virtual table parses a query string and yields all name/value pairs.

This query takes 500ms:

```sql
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (1,2,3,...,26623)
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
```

If I change the range of IDs up to 26624 it takes 50s. This on its own is already interesting. I must be running into some limitation and maybe it starts paging or something?

Here's what I also found, which is even more interesting. If I make `search_params(search)` constant, the jump happens from 3071 to 3072 (500x from 10ms to 5s).

```sql
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params('?foo=bar') AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (1,2,3,...,3071)
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
```

Without the `IN` condition (running the query on all 31k flows) it takes roughly as long as with 26623 flows.

Now my workaround is to use `json_each` like this:

```sql
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (SELECT value FROM json_each('[1,2,3,...,26623]'))
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
```

Which is just as fast but seems to scale effortlessly (and as a bonus can make use of parameters for any number of IDs).

`EXPLAIN` returns 80k and 9k rows respectively (that sounds like a lot). With the `json_each` implementation it's like 70 (not k, just 70). Maybe I'm running into problems with the generic virtual table implementation of `xBestIndex` or `xFilter` (https://github.com/JoshuaWise/better-sqlite3/blob/master/src/util/custom-table.lzz)? But why would that have any influence on how `flows.id` is queried?

You can download the `EXPLAIN` results here:

* https://drive.protonmail.com/urls/MP56BT56F0#S0VWMC1hD29I
* https://drive.protonmail.com/urls/QHZV7KWHV0#Yf0DvjP97B2A
* https://drive.protonmail.com/urls/8VZEQ3JXFR#5P6EEle67dCF
* https://drive.protonmail.com/urls/5B3D7M3WS8#5xib2GAKU03f