How SQLite decides if a subquery is correlated or not?

(1) By Qingyao Sun (nalzok) on 2020-08-07 03:30:11 [link] [source]

As a background, the question stems from my observation that a query that once took ~30s finishes in ~0.05s after I replaced an EXISTS with IN in the WHERE clause. I have constructed the following example to reproduce the issue to you. While this thread is self-contained, you are more than welcome to check out this related question on DBA.SE and this previous thread.

Assume that we have a list of natural numbers running from 0 to 999.

CREATE TABLE IF NOT EXISTS natural_numbers AS
WITH RECURSIVE natural_numbers(x) AS (
    SELECT 0
    UNION
    SELECT x + 1
    FROM natural_numbers
    LIMIT 1000
)
SELECT x
FROM natural_numbers;

Now, I want to draw 5 out of the 1000 numbers to be "lucky numbers", and count how many numbers are lucky. What I did is

WITH lucky_numbers(lucky_number) AS (
    SELECT x
    FROM natural_numbers
    ORDER BY RANDOM()
    LIMIT 5
)
SELECT SUM(x IN lucky_numbers)
FROM natural_numbers;

and the query plan is

QUERY PLAN
|--SCAN TABLE natural_numbers
`--LIST SUBQUERY 2
   |--SCAN TABLE natural_numbers
   `--USE TEMP B-TREE FOR ORDER BY

While subquery 1 lucky_numbers is non-correlated, I'm surprised to see that subquery 2 (x IN lucky_numbers) is not considered a correlated subquery. As pointed out by @drh earlier,

[T]he programmer cannot make any assumptions about whether or not the results of a (non-correlated) subquery are cached and reused or if the subquery is run multiple times.

If (x IN lucky_numbers) is non-correlated, then the optimizer might decide to return the same value for all x, which would be blatantly wrong. I'm wondering how SQLite decides if a subquery is correlated or not?

For comparison, here is an alternative query

WITH lucky_numbers(lucky_number) AS (
    SELECT x
    FROM natural_numbers
    ORDER BY RANDOM()
    LIMIT 5
)
SELECT SUM(EXISTS(
        SELECT 1
        FROM lucky_numbers
        WHERE x = lucky_number
))
FROM natural_numbers;

Its query plan correctly identifies subquery 2 (SELECT 1 FROM lucky_numbers WHERE x = lucky_number) as a correlated scalar subquery.

QUERY PLAN
|--SCAN TABLE natural_numbers
`--CORRELATED SCALAR SUBQUERY 2
   |--CO-ROUTINE 1
   |  |--SCAN TABLE natural_numbers
   |  `--USE TEMP B-TREE FOR ORDER BY
   `--SCAN SUBQUERY 1

Thus, while lucky_numbers may still be cached, it is guaranteed that (SELECT 1 FROM lucky_numbers WHERE x = lucky_number) will be executed for each x.

In addition, the bundled query explainer in PyCharm Professional indicates that with IN the subquery is only once (see red circle), whereas with EXISTS it's executed repeatedly. I'm not sure if that's related to whether the subquery is considered correlated or not.

(2) By Keith Medcalf (kmedcalf) on 2020-08-07 15:29:27 in reply to 1 [link] [source]

A subquery is correlated where it contains a reference to the parent row -- that is, it requires evaluation for each row in the outer query.

select a,
       (select b from c),
  from d;

the subquery is not correlated -- its result is independent of the parent.

select a,
       (select b from c where g == a)
  from d;

the subquery is correlated because it must be evaluated separately for each row in d because the subquery is "correlated" to the value of "a" in the outer (containing) query.

So a subquery is either independent or it is correlated and telling the difference is trivial.

(3.1) By Keith Medcalf (kmedcalf) on 2020-08-07 15:51:04 edited from 3.0 in reply to 1 [source]

x in lucky_numbers

is not a subquery. This actually means, where lucky_numbers is a table-like object:

x in (select luckynumber from lucky_numbers)

and the subquery is independent of the value of x. It is not a correlated subquery.

Similarly the other query is not correlated, even though it must be evaluated for each row of the outer table, it does not depend on a value of that outer table and is therefore independent, not correlated.

It is up to the planner to decide whether an "independent" query will be run more than once. If the results of such a query are non-deterministic, you have created yourself a problem because the planner assumes that the result of a query is deterministic and stable. If it does not hold that property then you run into the problem you have observed (not to mention that the entire concept of a "computer" is dependent on the assumption of determinism -- that doing exactly the same thing again with the same inputs will achieve the same output).

It you violate this fundamental design principal of the universe then all hell breaks loose.

(4) By Qingyao Sun (nalzok) on 2020-08-08 00:40:10 in reply to 3.1 [link] [source]

Thanks, so the takeaway is that SQL is agnostic to (non-)determinism, and its decisions are based solely on if there is any dependency. This is a little surprising to me, but I guess that's how things are specified, so I need to obey the rule :)