Unexpected results of random number filtering

(1) By anonymous on 2023-11-02 14:02:21 [link] [source]

This example has two problems and is trimmed down from a complex query, where random() numbers are made in a subquery and filtered in the outer query.

    sqlite> CREATE TABLE t (name VARCHAR(20));
    sqlite> INSERT INTO  t VALUES ('a'), ('b'), ('c');
    sqlite>
    sqlite> SELECT n FROM (SELECT random() n FROM t) WHERE n > 0;
    -635759060470222914
    1840456494413997070

First problem

It outputs negative numbers, which the outer query is intended to filter out.

The snippet was tested on dbfiddle.com and sqlfiddle.com; random() became rand(), which is between 0 and 1, so n > 0 became n > 0.5, but the gist does not change.

PostgreSQL filters as expected on versions 15, 9.6.

    SELECT n FROM (SELECT random() n FROM t) t2 WHERE n > 0.5;

MySQL filters as expected on version 5.6; not on 8.0, 5.7 (same problem as SQLite).

    SELECT n FROM (SELECT rand()   n FROM t) t2 WHERE n > 0.5;

MS SQL Server 2017 filters as expected; either returns nothing, or returns the same number thrice, which is less random and maybe unintended.

    SELECT n FROM (SELECT rand()   n FROM t) t2 WHERE n > 0.5;

For Oracle, the query should be something like SELECT n FROM (SELECT dbms_random.value(-50, 50) AS n FROM t) t2 WHERE n > 0; but sqlfiddle.com's Oracle seems to block dbms_random.value() for an unprivileged user.

Mechanism

random() seems to be called twice per row: the first time in the predicate condition, and (if that passes) a second number is generated for the resulting column. (Found out by Habbie and pekster, from libera.chat's ##sqlite. Thank you!)

The relevant section seems to be 2.4. "Generation of the set of result rows" (link). The inner query's result expression list random() should be evaluated and not enter the outer query. The outer query should work on the input data (numbers) and its own result expression list (n instead of random()).

But SQLite apparently does as expected, in a totally different query, where random() is called thrice explicitly: SELECT n FROM (SELECT random() n UNION ALL SELECT random() UNION ALL SELECT random()) WHERE n > 0;

Second problem

If that is the mechanism, then it is dangerous if one happens to use a function with side effects or non-idempotent, instead of random(), no?

Such a function may be created with create_function().

(2) By vorimi on 2023-11-02 14:43:44 in reply to 1 [link] [source]

here is repro with HAVING clause: dbfiddle.uk/WV3HFYjR

(3) By John Dennis (jdennis) on 2023-11-03 09:33:16 in reply to 1 [link] [source]

SELECT n FROM (SELECT random() n FROM t) WHERE CAST (n as INTEGER) > 0;
works for me. Curious, given random() returns an integer.

(4) By vorimi on 2023-11-03 18:58:55 in reply to 3 [link] [source]

The issue is present even with the query above (/w cast()): dbfiddle.uk/gsOOzWIA

(5) By anonymous on 2023-11-04 01:14:41 in reply to 1 [link] [source]

It gets weirder. Run these two statements a few times each and note the different behaviour:

    select n as n1, n as n2 from (select random() as n);

    select n as n1, n as n2 from (select random() as n from (values(1)));

Any "working as intended" explanation had better cover this case as well.

(6) By Ryan Smith (cuz) on 2023-11-04 10:13:44 in reply to 1 [link] [source]

This thread doesn't seem to get any real attention, so I thought I'd just quickly explain that the reason is that this is a very well known, well documented, and discussed to death on the forum already, situation. I'm not helping by reiterating it here, but it seems there's enough onlookers to warrant another round, so here goes....

The simplistic version:
Whenever the statement "Random()" is invoked, it gives a fully new random value (this can be the case for ANY non-deterministic function, not just Random()), if it is touched 3 times in one query, you will get 3 different results, that holds across predicates, row-values and calculations/functions/aggregates.

So most of the mentioned posts have zero surprises to anyone here.

Another thing that can happen, fully independent yet seemingly coincidental, is that the SQLite Query planner can at any time, if it sees fit or thinks that cycles will be saved, materialize a view or sub-view/sub-query. An easy to cook-up example of this is when ordering a result-set in a sub-query or use MATERIALIZE in a CTE, in which case, when selecting from the materialized view/row-set, suddenly all values seem stable and repeatable. This may even happen occasional or seemingly random and at the whim of the QP.

That too, is in no way exciting or weird to anyone here.

To answer one relatively important question:

it is dangerous if one happens to use a function with side effects or non-idempotent, instead of random(), no?

Dangerous to whom? If you create such a function, it's up to you to mark it as Deterministic or not, and if you do and lie to yourself or your users, then yes, stuff will break, as it should, good.

What I would say is that it is best to really look into how your DB engine of choice deals with such functions, they all differ. Some treat time as idempotent for instance, at least within transactions, but obviously that cannot be the case for the Random() function, if it was ever repeating the same value, even in a transaction, it would become utterly useless. You can force this by materializing views etc, but again, that mechanism works differently for different engines.

(7.1) By vorimi on 2023-11-04 14:40:42 edited from 7.0 in reply to 6 [source]

well known, well documented, and discussed to death on the forum already

Can you please cite the docs where this is documented and reasoned?

select n as n1, n as n2 from (select random() as n from (values(1)));

repro: dbfiddle.uk/MzDRrs7R

I belive the SQL standard requires n1 and n2 to be always the same, ie. any non-deterministic function must not be called more than once per row & expression.

I belive this issue should be fixed.

(8) By Ryan Smith (cuz) on 2023-11-06 12:12:28 in reply to 7.1 [link] [source]

Can you please cite the docs where this is documented and reasoned?

Not reiterating was my goal, but sure:
Documentation on determinism

I belive the SQL standard requires n1 and n2 to be always the same, ie. any non-deterministic function must not be called more than once per row & expression.

Now it's my turn to ask, can you cite where this is stated explicitly in the standard please? I have not seen such requirement in my days, but then, I have not seen everything.

Even if it is, the standard is a bit like the Pirate code, it's more suggestions than actual rules - however, if you do produce such a citation, I for one will be convinced that this is a real case to be looked into. At this point however, I am skeptical.

(9.2) By Donal Fellows (dkfellows) on 2023-11-06 12:25:25 edited from 9.1 in reply to 8 [link] [source]

In this case, I'd expect n1 and n2 to be the same because there is a single row in the (temporary, unnamed, scoped to the query) table from which they originate. The contents of that originating table are determined non-deterministically, but the steps after that are deterministic so far as I can see.

That doesn't happen in 3.44.0 (as deployed at https://sqlite.org/fiddle/)

(10) By Ryan Smith (cuz) on 2023-11-06 13:00:13 in reply to 9.2 [link] [source]

You'd expect? Based on what?

I don't disagree what the human intuitive thing would be, but there is no real basis to expect it other than human flawed intuition - at least, that's why I asked for citation from the other poster. I don't expect there to be wording to the affect he claimed or to what you profess to expect, but if it isn't stated so, or required so (which it may yet be!), then the grounds for expecting it to be so is nothing more than wishes.

The function is referenced multiple times, the fact that it is aliased at times says nothing over whether it should be run again. This is true in all programming languages (or let me rather say "most" before someone finds a language that altered this specific quirk).

This usually stems from a tendency among programmers to think of aliases as variables, which it ain't. If you call rand(x) "bobo" or "n2" and later call on "n2" what you'll find is not a stored value, but a call to rand(x) again. The only time this changes, in ANY sql engine, is when the sub-query or view gets materialized. There are good reasons for doing it, many engines have different reasons for doing it and do so at different times. I don't think the SQL standard prescribes exactly when materialization MUST happen, though I think I've seen text to the effect of saying when it may happen.

(11) By vorimi on 2023-11-08 12:50:00 in reply to 10 [link] [source]

I have consulted the answer with several people.

There was a consensus that the query from https://sqlite.org/forum/forumpost/8202db3916

select n as n1, n as n2 from (select random() as n from (values(1)));

is equivalent to:

select t.n as n1, t.n as n2 from (select random() as n from (values(1))) t;

repro: https://dbfiddle.uk/V91Wk2dd

The consensus was that the t table has (formal) evaluation priority and no internal optimization should make it volatile. I am unable to provide SQL standard citation. The feedback was this topic might be not even fully covered in the SQL standard.

Here is a specification of PostgreSQL https://www.postgresql.org/docs/current/xfunc-volatility.html . So at least PostgreSQL discusses it and guarantees single evaluation per row when a function is VOLATILE.

As the PostgreSQL spec makes results with SELECT n ... WHERE n = x much more sensual, the question is, is it feasible to implement non-DETERMINISTIC SQLite functions to be evaluated as per the PostgreSQL VOLATILE spec? If yes, are there any negatives (BC break, performance)?

(12) By anonymous on 2023-11-10 11:17:54 in reply to 6 [link] [source]

But your interpretation contradicts 2.4. Generation of the set of result rows as the function should be called once per row (not once per column).

Any DB admin and programmer is perplexed by seeing the negative numbers out of the above SELECT n FROM [unimportant] WHERE n > 0 (99% of the userbase?) but here's another example broken down.

    select n as n1, n as n2 from (select random() as n from (values(1)));

The list of expressions between the SELECT and FROM keywords is known as the result expression list.

In the inner query, it is random() as n.

Each result row is calculated by evaluating the expressions in the result expression list with respect to a single row of input data

Your interpretation:

evaluate random() as n means do nothing and keep the function call
The result row of the inner query contains code (function call).

Either the doc wording needs correction, or this interpretation is stretched to the point of being unreasonable.

Objective meaning:

evaluate random() as n means call the function (implying that side-effects run at this step, not later)
the result row of the inner query contains a number, a result row can not contain code.

(13) By Richard Hipp (drh) on 2023-11-10 15:52:58 in reply to 1 [link] [source]

A MATERIALIZED common table expression acts as an optimization barrier. So you can get the result you desire by rewriting the subquery as a materialized common-table expression:

CREATE TABLE t (name TEXT);
INSERT INTO  t VALUES ('a'), ('b'), ('c');
WITH r(n) AS MATERIALIZED (SELECT random() FROM t)
  SELECT n FROM r WHERE n>0;