big performance hit when querying 2 values in a single query instead of two

(1) By anonymous on 2021-11-21 20:26:55 [source]

i have the following schema filled with data:

CREATE TABLE pro_comment( id int, post_id int, parent_id int, content text, created int, up int, down int, confidence real, name text, user_id int, mark int, item_id int ); CREATE INDEX pro_comment_id on pro_comment (id); CREATE INDEX pro_comment_post_id__created on pro_comment (post_id, created); CREATE INDEX pro_comment_name on pro_comment (name);

and the following queries and their results:

select 'max id:', max(id) from pro_comment; max id:|52888040 Run Time: real 0.000 user 0.000000 sys 0.000082

select 'count distinct:', count(distinct id) from pro_comment; count distinct:|46090875 Run Time: real 6.416 user 5.798688 sys 0.556307

select 'max id & count distinct:', max(id), count(distinct id) from pro_comment; max id & count distinct:|52888040|46090875 Run Time: real 55.189 user 52.214868 sys 2.169683

so when i select max(id) and count(distinct id) separately, it takes <7 seconds; when i select them in the same query it takes 55 seconds which seems odd to me.

i am not certain if this is a bug or already explained somewhere. if it is not a bug, i'd appreciate if someone could link me some explanation for this behavior, so i can dodge it in the future.

observerd on debian, sqlite3 version: 3.36.0 2021-06-18 18:36:39 5c9a6c06871cb9fe42814af9c039eb6da5427a6ec28f187af7ebfb62eafaalt1

(2) By Keith Medcalf (kmedcalf) on 2021-11-21 20:54:00 in reply to 1 [link] [source]

Use EXPLAIN or EXPLAIN QUERY PLAN preface to the queries to see what is happening. In the CLI you can also use .eqp on which automagically does explain query plan or .eqp full to automagic explain.

Try using this form:

select (select max(id) from pro_comment),
       (select count(distinct id) from pro_comment)
;

This allows each scalar subquery to be optimized (executed) independently. Using both max(id) and count(distinct id) in the same single query prevents the optimizer from applying some optimizations (particulary the minmax optimization).

However, even as expressed, the total time taken for the combined query will be the same. Your last select statement select 'max id & count distinct:', max(id), count(distinct id) from pro_comment; should not take more time than a mere select count(distinct id) from pro_comment; since that requires a full scan of the table -- I cannot explain what is happening on the computer that is using the additional 47 seconds of CPU or 2 seconds of system time other than to observe that some OTHER process is going full-ninja-kaboom using your CPU and I/O.

(4) By Keith Medcalf (kmedcalf) on 2021-11-21 21:50:37 in reply to 2 [link] [source]

Interesting.

For some reason the optimizer chooses to use a separate b-tree for distinct even though there is already an index. It does this even for the base query select count(distinct id) from pro_comment sometimes, even though scanning the existing index would be more efficient (lower cost).

Generating the separate b-tree requires an additional scan for no benefit.

I presume that this is because the optimizer is not considering the more direct solution of merely doing a single scan of the index to generate the result.

(3) By Simon Slavin (slavin) on 2021-11-21 21:15:33 in reply to 1 [link] [source]

One SELECT is meant to return a number of rows, and the values on each row relate to those rows. In other words, SQL does processing based on that idea: it tries to use one operation per table to get the results you asked for. Let's look at your SELECT:

select 'max id & count distinct:', max(id), count(distinct id) from pro_comment;

The max(id) only needs one row to be retrieved: the one which has the biggest value for id. Since there's an index on this value, it can do this by looking at whatever row is last in this index. Fast and easy to find.

But then in the same query you ask for count(distinct id). Which requires more than one row to be read from the table. It has to scan a whole index. Well, that's okay, it has the index it needs.

But in the combined query, SQL is meant to retrieve both these figures with the same operation. That could be a lot more complicated. Let's see what it really does. Let's look at EXPLAIN QUERY PLAN on all three of those queries:

sqlite> EXPLAIN QUERY PLAN select 'max id:', max(id) from pro_comment;
QUERY PLAN
`--SEARCH pro_comment USING COVERING INDEX pro_comment_id
sqlite> EXPLAIN QUERY PLAN select 'count distinct:', count(distinct id) from pro_comment;
QUERY PLAN
`--SCAN pro_comment USING COVERING INDEX pro_comment_id
sqlite> EXPLAIN QUERY PLAN select 'max id & count distinct:', max(id), count(distinct id) from pro_comment;
QUERY PLAN
|--USE TEMP B-TREE FOR count(DISTINCT)
`--SCAN pro_comment USING COVERING INDEX pro_comment_id

As you see, SQLite, in trying to get both figures from the same search, thinks it has to make a temporary index of the data. The solution is to accept that you want two different figures that aren't related, and ask for them in separate statements. Which is what you did yourself in your demonstration of the problem. Well done.

It could be that this particular optimisation is easy and fast to identify. In which case, the developers (who are reading your thread) might add it. But if you have SQLite check for every possible optimisation every time it does a search, every search is going to take a long time while SQLite checks to see if it qualifies for every possible optimisation. Which, for most SELECTs most of the time, will slow SQLite down.