Distinct subquery with UNION ALL gives unexpected result
(1.1) By jh (jimhog) on 2023-08-31 17:18:30 edited from 1.0 [source]
Hello,
CREATE TABLE IF NOT EXISTS "t1" ("id" TEXT);
INSERT INTO t1 VALUES('a');
INSERT INTO t1 VALUES('b');
CREATE TABLE IF NOT EXISTS "t2" ("id" TEXT);
INSERT INTO t2 VALUES('c');
SELECT COUNT(1) FROM (SELECT DISTINCT * FROM t1 UNION ALL SELECT * FROM t2);
SELECT * FROM (SELECT DISTINCT * FROM t1 UNION ALL SELECT * FROM t2);
SELECT COUNT(1) FROM (SELECT DISTINCT * FROM t1 UNION ALL SELECT DISTINCT * FROM t2);
SELECT COUNT(1) FROM (SELECT DISTINCT * FROM t1 UNION SELECT * FROM t2);
Running the above SQL yields the below output which I find inconsistent.
2
a
b
c
3
3
Why does the first count return 2? There are no duplicates across the two tables, yet the DISTINCT
and UNION ALL
affect the outcome of the count. The first count query and the SELECT *
query are otherwise identical yet return a different number of rows and count.
I see this behavior in SQLite 3.43.0 and 3.42.0, but not in 3.41.2.
Help understanding this would be appreciated.
(2) By Richard Hipp (drh) on 2023-08-31 18:07:40 in reply to 1.1 [link] [source]
Simplified test case:
CREATE TABLE IF NOT EXISTS t1(id TEXT); INSERT INTO t1 VALUES('a'),('b'); SELECT * FROM (SELECT DISTINCT * FROM t1 UNION ALL SELECT * FROM t1); SELECT count() FROM (SELECT DISTINCT * FROM t1 UNION ALL SELECT * FROM t1); SELECT count(1) FROM (SELECT DISTINCT * FROM t1 UNION ALL SELECT * FROM t1);
The answer from the first two SELECT statements is correct. Then answer from the third SELECT is wrong.
This problem goes back to check-in 7c2d3406000dc8a (2023-02-22) which was an optimization request. The problem also exists in version 3.42.0. All releases prior to 3.42.0 seems to be ok.
(3) By jh (jimhog) on 2023-08-31 18:42:49 in reply to 2 [link] [source]
Thank you for confirming my suspicion. Do you know if there is a bug report for this, or would you like me to submit one?
(4) By Spindrift (spindrift) on 2023-08-31 18:46:00 in reply to 3 [link] [source]
You already did, thank you.
(6) By jose isaias cabrera (jicman) on 2023-08-31 18:56:20 in reply to 2 [link] [source]
Dr. Hipp, when you say,
This problem goes back to check-in 7c2d3406000dc8a (2023-02-22) which was an optimization request.
does that mean that this optimization is no longer available?
(7) By Richard Hipp (drh) on 2023-08-31 19:09:48 in reply to 6 [link] [source]
No, the optimization still works.
The thing with optimizations is that you need to be careful not to over-optimize - to apply an optimization in a situation where it is not valid - where it might result in an incorrect answer. That's what was happening here.
The omit-unused-subquery-columns optimization tries to avoid computing columns in a subquery that are never actually used in the outer query. But that won't work if the subquery is DISTINCT, because in a DISTINCT query, all columns are intrinsically used. So the omit-unused-subquery-columns optimization has a test to ensure that it does not run on a DISTINCT subquery.
But that test was incorrect for the case of a UNION ALL subquery where only the left operand of the UNION ALL is DISTINCT but not the right operand.
(5.1) Originally by Dan Kennedy (dan) with edits by Richard Hipp (drh) on 2023-08-31 20:24:34 from 5.0 in reply to 1.1 [link] [source]
Thanks for reporting this. Should now be fixed here:
https://sqlite.org/src/info/c84d5602ac9bfb4f
Dan.
(8) By jh (jimhog) on 2023-08-31 19:33:58 in reply to 5.0 [link] [source]
Nice. Thank you for fixing it.
(9) By anonymous on 2023-08-31 20:19:08 in reply to 5.0 [link] [source]
Looks like you mis-pasted. Try https://www.sqlite.org/src/info/c84d5602ac9bfb4f instead.