Suboptimal behaviour of COUNT(DISTINCT column)

(1) By Endrju (endrju0sqlite) on 2024-01-24 21:08:19 [source]

Given the following table:

CREATE TABLE gravity
(
  domain TEXT NOT NULL,
  adlist_id INTEGER NOT NULL REFERENCES adlist (id),
  PRIMARY KEY(domain, adlist_id)
);

The following query uses index:

sqlite> explain query plan select count(*) from (select distinct domain from gravity);
QUERY PLAN
|--CO-ROUTINE (subquery-1)
|  `--SCAN gravity USING COVERING INDEX sqlite_autoindex_gravity_1
`--SCAN (subquery-1)

but this one doesn't:

sqlite> explain query plan select count(distinct domain) from gravity;
QUERY PLAN
|--USE TEMP B-TREE FOR count(DISTINCT)
`--SCAN gravity

Using the sample data:

insert into gravity select 'domain'||(value%100000),value from generate_series(1,10000000);

it gives me a big difference in execution times:

sqlite> select count(*) from (select distinct domain from gravity);
100000
Run Time: real 0.943 user 0.937500 sys 0.000000
sqlite> select count(distinct domain) from gravity;
100000
Run Time: real 7.526 user 7.500000 sys 0.000000

This is a powerful PC machine. On smaller machines it makes bigger difference even with 10 times smaller data (1sec vs 12 sec below 1 million rows).

I would expect the latter behave similar to the former.

This is SQLite version 3.45.0 2024-01-15 17:01:13 (UTF-16 console I/O)

Please consider fixing it. Thank you in advance.

(2) By Endrju (endrju0sqlite) on 2024-01-29 16:48:58 in reply to 1 [link] [source]

Friendly bump, as it seem it went unnoticed. Thank you.

(3) By Richard Hipp (drh) on 2024-01-29 16:53:18 in reply to 2 [link] [source]

It's been noticed. I just don't have a solution, and frankly I've been busy with other things and have not had a lot of time to look at it.

If you have funding available that can help expedite an enhancement that will address your query planning concerns, you can contact me off-line at drh at sqlite dot org.

(4) By Endrju (endrju0sqlite) on 2024-01-29 17:56:44 in reply to 3 [link] [source]

Thank you. It doesn't require immediate action, I'd just like you to put it on some TODO for the (maybe distant) future so it won't be forgotten. Thanks!

(6) By anonymous on 2024-01-30 09:34:34 in reply to 4 [link] [source]

Hi,

You can try the following in the mean time till the fix is made.

The reason index is not used, I guess is because it is taking Count(Distinct domain) as a function and the moment function is used, the index goes out of the window.

You can force it to use the index by using Indexed by SQLite extension.

https://www.sqlite.org/lang_indexedby.html

The INDEXED BY phrase forces the SQLite query planner to use a particular named index on a DELETE, SELECT, or UPDATE statement. The INDEXED BY phrase is an SQLite extension and is not portable to other SQL database engines.

Here is the proof, I used your given example. Since you did not provide the table details for adlist which is used in the reference, I omitted it.

CREATE TABLE gravity ( domain TEXT NOT NULL, adlist_id INTEGER NOT NULL, PRIMARY KEY(domain, adlist_id) );

Find out the index name by using the following command

.indexes gravity

sqlite_autoindex_gravity_1

Without using Indexed By extension.

explain query plan select count(distinct domain) from gravity; QUERY PLAN |--USE TEMP B-TREE FOR count(DISTINCT) --SCAN gravity QUERY PLAN |--USE TEMP B-TREE FOR count(DISTINCT) --SCAN gravity

select count(distinct domain) from gravity;

100000

Run Time: real 5.898 user 5.625000 sys 0.281250

Now with using Indexed By extension.

explain query plan select count(distinct domain) from gravity indexed by sqlite_autoindex_gravity_1; QUERY PLAN `--SCAN gravity USING COVERING INDEX sqlite_autoindex_gravity_1

select count(distinct domain) as distinct_count from gravity indexed by sqlite_autoindex_gravity_1;

100000

Run Time: real 0.926 user 0.609375 sys 0.328125

Hope this helps you out, till this is fixed.

(7) By Richard Hipp (drh) on 2024-01-30 10:43:50 in reply to 4 [link] [source]

Yesterday, Dan put in check-in ad06868807a27f0c that does what you want, I believe. Please try the latest trunk version of SQLite (you'll need to build it yourself) and see if it works for you. Report back here what you find.

(9) By Endrju (endrju0sqlite) on 2024-01-30 18:45:31 in reply to 7 [link] [source]

Hats off! You really did not need to rush with this issue. Starting tomorrow I'll be offline for a few weeks, but after coming back I'll try if it works, and report back. Hope the (database) world will be a tiny bit better, or faster :-)

Thank you!

(5) By Bernard (bssqlite) on 2024-01-30 08:17:57 in reply to 1 [link] [source]

select count(distinct domain) from gravity INDEXED BY sqlite_autoindex_gravity_1;

(8.1) By Dominik (dominik) on 2024-01-30 18:00:42 edited from 8.0 in reply to 1 [link] [source]

Adding another simple index

CREATE INDEX idx_gravity_domains_only ON gravity (domain);

makes them perform equally well.

If adding another index is what you want depends on your particular case. I did note that your particular test case have 1 mio. rows, so another index comes at substantial amounts of extra time and memory (disk space) costs.

(10) By Endrju (endrju0sqlite) on 2024-01-30 18:47:50 in reply to 1 [link] [source]

Thanks everybody for suggestions and explanations. I appreciate your insightful answers!