100x jump in query time with IN operator

(1) By Alexander Prinzhorn (Prinzhorn) on 2021-10-25 08:00:52 [link]

I have a query that uses `WHERE id IN (1,2,3,...)` where the list `(1,2,3,...)` is dynamically generated from an array of integers (_not_ using parameters). Now I have a particular query that takes roughly 500ms with 26623 ids but 50s (100x slower) with 26624 ids. In a different scenario the jump in query time happens between 3071 and 3072 IDs.

This happens in Node.js via better-sqlite3 (with its default compile options https://github.com/JoshuaWise/better-sqlite3/blob/23c56aa77e44689e56c56370345e3c3cc7895128/docs/compilation.md#bundled-configuration) with SQLite 3.36.0.

This query uses a custom virtual table and I'm not sure how to reproduce it without it and if it is part of the issue. This means I can't easily provide steps to reproduce it with vanilla SQLite. But I will provide as much information as I can.

The database contains normalized HTTP traffic. What the query does is get all unique URL parameters with a list of unique values. The `search_params` virtual table parses a query string and yields all name/value pairs.

This query takes 500ms:

```sql
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (1,2,3,...,26623)
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
```

If I change the range of IDs up to 26624 it takes 50s. This on its own is already interesting. I must be running into some limitation and maybe it starts paging or something?

Here's what I also found, which is even more interesting. If I make `search_params(search)` constant, the jump happens from 3071 to 3072 (500x from 10ms to 5s).

```sql
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params('?foo=bar') AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (1,2,3,...,3071)
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
```

Without the `IN` condition (running the query on all 31k flows) it takes roughly as long as with 26623 flows.

Now my workaround is to use `json_each` like this:

```sql
SELECT params.name AS name, json_group_array(DISTINCT params.value) AS "values"
FROM view_requests AS req, search_params(search) AS params
JOIN flows ON flows.request_id = req.id
WHERE flows.id IN (SELECT value FROM json_each('[1,2,3,...,26623]'))
GROUP BY params.name
ORDER BY json_array_length("values") DESC, params.name ASC
```

Which is just as fast but seems to scale effortlessly (and as a bonus can make use of parameters for any number of IDs).

`EXPLAIN` returns 80k and 9k rows respectively (that sounds like a lot). With the `json_each` implementation it's like 70 (not k, just 70). Maybe I'm running into problems with the generic virtual table implementation of `xBestIndex` or `xFilter` (https://github.com/JoshuaWise/better-sqlite3/blob/master/src/util/custom-table.lzz)? But why would that have any influence on how `flows.id` is queried?

You can download the `EXPLAIN` results here:

* https://drive.protonmail.com/urls/MP56BT56F0#S0VWMC1hD29I
* https://drive.protonmail.com/urls/QHZV7KWHV0#Yf0DvjP97B2A
* https://drive.protonmail.com/urls/8VZEQ3JXFR#5P6EEle67dCF
* https://drive.protonmail.com/urls/5B3D7M3WS8#5xib2GAKU03f

(2) By Gunter Hick (gunter_hick) on 2021-10-25 12:06:56 in reply to 1

SQLite NGQP is a cost based query planner. The IN () operator with a list of literal values gets implemented as a kind of temporary table; sometimes SQLite decides to create an index and do lookups, other times it decides to use that table as the outermost loop of the query.

EXPLAIN QUERY PLAN should show that in a more concise manner.

If compiled in DEBUG mode mith WHERETRACE enabled, the .wheretrace command will show how SQLite NGQP reaches its plan. Essential input is the return values from the xBestIndex method of your virtual table, especially the "number of rows" and the "estimated cost". It is paramount to deliver accurate estimates. Cost should reflect processing cost relative to SQLite native tables.

Note that you can name the IN table by making it a CTE and CROSS JOIN to force the query plan that works fast.

(3.1) By Gunter Hick (gunter_hick) on 2021-10-25 12:07:14 edited from 3.0 in reply to 1 [link]

Deleted