Bad query results involving random()
Observe the following query:
WITH RECURSIVE cnt(x) AS(values(1) UNION ALL SELECT x+1 FROM cnt WHERE x<100), ran(x) AS(SELECT random() & 3 FROM cnt) SELECT x FROM ran WHERE x = 1;
That should obviously only output '1's or nothing, according to the WHERE clause. But I'm seeing it return a list of random numbers between 0 and 3.
I'm guessing that the
random() & 3 subexpression is being inlined in the query plan, hence using different values for comparison and returning.
OS: Gentoo amd64.
$ sqlite3 --version 3.33.0 2020-08-14 13:23:32 fca8dc8b578f215a969cd899336378966156154710873e68b3d9ac5881b0alt1
That should obviously only output '1's or nothing...
It should obviously not.
When "SELECT random() & 3" gets evaluated for the WHERE clause it may well have the value 1, but when the same value moments later get evaluated again for the output queue, it can very much result in a different value this time. Random is NOT a deterministic function (luckily) and two consecutive calls to Random() should not and would not give the same answer (unless by random luck).
Do not confuse "Fields" in a RDBMS with "Variables" in software. Even though they share some properties, they are not the same thing. A field based on a calculation may (or ma not, depending on engine efficiency decisions) be evaluated multiple times, even during the same step in the same query, and may or may not result in different values every time if the field's calculation is non-deterministic.
"Random()" is precisely such a calculation.
PS: There are tricks for making the random data into a materialized set (which won't change) if you are interested, but that is not really the question here.
Er, wow, alright. My mental model of CTEs and subqueries in FROM is that they semantically behave as if the result set of those queries is materialized - even if the implementation may optimize that.
This behavior definitely surprised me, and also implies that filtering on random() is impossible save for an explicit materialization step. Another interesting implication is that the expression
x <> x can (and indeed does) evaluate to true.
If you're correct (and who am I to say you're not - I'm in no way qualified to argue the semantics of SQL), I suppose it's time to update my mental model...
Note also that this is not merely SQLite being a bit quirky, this behaviour of "remembering output from functions" is specifically unwanted and eschewed in SQL for non-deterministic functions in most cases.
Imagine for a moment that you were interested in the real time during any part of a statement, you would be really dismayed if simply every row/inserted value has exactly the same time. The same goes for Random values for most use-cases.
That said there are cases in which the same time is very much required, at least throughout a transaction (say to ensure several related transactions all carry the same Posting time) as much you needed the same Random() value. To achieve that SQLite allows adding user-defined functions marked as "Deterministic" or non-deterministic (as described in the link above). In other engines there are sometimes different time functions for deterministic times and non-deterministic times. I've not seen a deterministic Random() function yet, but I haven't seen everything. :)
Lastly, if you can force the ephemeral table (composed by the CTE) to be materialized, then your query output is remembered (obviously) and every consecutive reference will have the same value for the called function.
I am never doing it that way, because of preference only, but I have seen Keith mention that if you add an "ORDER BY" to the CTE it forces the materialization. Not sure if this is always or under special circumstances only - hopefully someone else can confirm/elaborate or correct me.
Hope that provides some clarity!