Query Optimizer Changes Meaning of Query

(1.1) By Qingyao Sun (nalzok) on 2020-08-06 09:06:13 edited from 1.0 [link]

Say I have a list of natural numbers from 0 to 99

```
CREATE TABLE IF NOT EXISTS natural_numbers AS
WITH RECURSIVE natural_numbers(x) AS (
    SELECT 0
    UNION
    SELECT x + 1
    FROM natural_numbers
    LIMIT 100
)
SELECT x
FROM natural_numbers;
```

I want to draw 50 random numbers to be "lucky numbers", and count how many lucky numbers there are in the table `natural_numbers`. Here is the query

```
WITH lucky_numbers(lucky_number) AS (
    SELECT x
    FROM natural_numbers
    ORDER BY RANDOM()
    LIMIT 50
)
SELECT SUM(EXISTS(
        SELECT 1
        FROM lucky_numbers
        WHERE x = +lucky_number  -- disable automatic index on lucky_numbers(lucky_number)
))
FROM natural_numbers;
```

The result is a random variable bouncing around 50 (you may need to re-run the query multiple times to see the effect). This can be surprising to some, but it's actually documented

> An ordinary common table expression works as if it were a view that exists for the duration of a single statement.

Since `lucky_numbers` is merely a view, or a pre-packaged SELECT statement, it is naturally re-run for each row in `natural_numbers`. Consequently, each of the 100 natural numbers has an *independent* 50% chance of being lucky.

Now, here is the bug I'm reporting： if you replace the clause `WHERE x = +lucky_number` with `WHERE x = lucky_number`, the result will always be 50, which means the `AUTOMATIC COVERING INDEX (lucky_number=?)` added by the query optimizer changes the outcome of a query. I suspect this is because the common table expression `lucky_numbers` is no longer re-run for each row, but instead cached in a covering index.

(2) By Keith Medcalf (kmedcalf) on 2020-08-06 07:57:30 in reply to 1.0 [link]

The current tip of trunk appears to already do this.

(3) By Gunter Hick (gunter_hick) on 2020-08-06 07:57:39 in reply to 1.0 [link]

I find the opposite effect:

asql> explain query plan with lucky(lucky) as (select val from numbers order by random() limit 50) select sum(exists(select 1 from lucky where val = lucky)) from numbers;
id    parent         notu  deta
----  -------------  ----  ----
3     0              0     SCAN TABLE numbers
5     0              0     CORRELATED SCALAR SUBQUERY
8     5              0     CO-ROUTINE 0x19668760
12    8              0     SCAN TABLE numbers
24    8              0     USE TEMP B-TREE FOR ORDER BY
41    5              0     SEARCH SUBQUERY 0x19668760 USING AUTOMATIC COVERING INDEX (lucky=?)
asql> explain query plan with lucky(lucky) as (select val from numbers order by random() limit 50) select sum(exists(select 1 from lucky where val = +lucky)) from numbers;
id    parent         notu  deta
----  -------------  ----  ----
3     0              0     SCAN TABLE numbers
5     0              0     CORRELATED SCALAR SUBQUERY
8     5              0     CO-ROUTINE 0x19667CE0
12    8              0     SCAN TABLE numbers
24    8              0     USE TEMP B-TREE FOR ORDER BY
31    5              0     SCAN SUBQUERY 0x19667CE0

https://sqlite.org/optoverview.html states

"Terms of the WHERE clause can be manually disqualified for use with indices by prepending a unary *+* operator to the column name."

(4) By Qingyao Sun (nalzok) on 2020-08-06 09:08:29 in reply to 3

Oops, my bad! The queries didn't match their respective results because I was clueless. The original post has been edited now.

(5) By Qingyao Sun (nalzok) on 2020-08-06 09:11:24 in reply to 2 [link]

I am using

```
3.31.1 2020-01-27 19:55:54 3bfa9cc97da10598521b342961df8f5f68c7388fa117345eeb516eaa837bb4d6
```

So I'd say the current tip of trunk appears to **still** do this :)

(6) By anonymous on 2020-08-06 17:08:57 in reply to 1.1 [link]

It should be possible to make the query optimizer to automatically figure out whether or not a view is deterministic. However, this doesn't work for virtual tables, unless it is enhanced by adding a `SQLITE_INDEX_SCAN_DETERMINISTIC` flag (or a `SQLITE_VTAB_DETERMINISTIC` option).

(I think there are other messages on this forum describing proposed improvement of virtual table.)

(7) By Richard Hipp (drh) on 2020-08-06 21:48:52 in reply to 1.1 [link]

The query

~~~~
   SELECT x FROM natural_numbers ORDER BY random() LIMIT 50;
~~~~

Returns a table of 50 distinct random numbers.  The question then is,
if that query is really a subquery in a larger SQL statement, and the
result set is used multiple times within the outer query, is the
SQL engine compelled to rerun the subquery, or is it allowed to cache
the results of the first run and reuse them.

I contend that the query planner is free to do it either way.  It can
either cache the results of the first run and reuse them.  Or it can
rerun the query.  Or (if it wanted to) it could cache the results and
use them two or three times and then rerun the query and use those
results two or three times, then run the subquery again, and so forth.

In other words, the programmer cannot make any assumptions about whether
or not the results of a (non-correlated) subquery are cached and reused
or if the subquery is run multiple times.  The SQL engine is free to do
whatever it wants.  Any query that depends on whether or not subquery results
are cached returns unpredictable output.  It is akin to running
a query without an ORDER BY clause - the engine is free to return the
results in any order it wants.  Just because it returns the results in
the order you want today does not mean it will continue to do so
tomorrow.

Constructing a query like this that depends on whether or not subquery
results are cached is like writing a multiple-threaded program that
depends on the order in which the threads are executed.  The output can
vary from one run to the next.  Don't do that.

I checked with the PostgreSQL developers and am told that PostgreSQL
works the same way.  The equivalent query in PostgreSQL might cache
and reuse the subquery results, or it might rerun the subquery.  You
never know.

As currently implemented, SQLite only caches the subquery results if it
wants to build an automatic index.  But that decision might change tomorrow,
so you cannot depend on it.

(8) By Qingyao Sun (nalzok) on 2020-08-07 02:56:52 in reply to 7 [link]

> In other words, the programmer cannot make any assumptions about whether or not the results of a (non-correlated) subquery are cached and reused or if the subquery is run multiple times

Thanks for the great explanation! I actually have a follow-up question about how SQLite decides if a subquery is correlated or not, but let's discuss that in a new thread.