big performance hit when querying 2 values in a single query instead of two
i have the following schema filled with data:
CREATE TABLE pro_comment(
CREATE INDEX pro_comment_id on pro_comment (id);
CREATE INDEX pro_comment_post_id__created on pro_comment (post_id, created);
CREATE INDEX pro_comment_name on pro_comment (name);
and the following queries and their results:
select 'max id:', max(id) from pro_comment;
Run Time: real 0.000 user 0.000000 sys 0.000082
select 'count distinct:', count(distinct id) from pro_comment;
Run Time: real 6.416 user 5.798688 sys 0.556307
select 'max id & count distinct:', max(id), count(distinct id) from pro_comment;
max id & count distinct:|52888040|46090875
Run Time: real 55.189 user 52.214868 sys 2.169683
so when i select max(id) and count(distinct id) separately, it takes <7 seconds; when i select them in the same query it takes 55 seconds which seems odd to me.
i am not certain if this is a bug or already explained somewhere. if it is not a bug, i'd appreciate if someone could link me some explanation for this behavior, so i can dodge it in the future.
observerd on debian, sqlite3 version: 3.36.0 2021-06-18 18:36:39 5c9a6c06871cb9fe42814af9c039eb6da5427a6ec28f187af7ebfb62eafaalt1
EXPLAIN QUERY PLAN preface to the queries to see what is happening. In the CLI you can also use
.eqp on which automagically does
explain query plan or
.eqp full to automagic
Try using this form:
select (select max(id) from pro_comment), (select count(distinct id) from pro_comment) ;
This allows each scalar subquery to be optimized (executed) independently. Using both max(id) and count(distinct id) in the same single query prevents the optimizer from applying some optimizations (particulary the minmax optimization).
However, even as expressed, the total time taken for the combined query will be the same. Your last select statement
select 'max id & count distinct:', max(id), count(distinct id) from pro_comment; should not take more time than a mere
select count(distinct id) from pro_comment; since that requires a full scan of the table -- I cannot explain what is happening on the computer that is using the additional 47 seconds of CPU or 2 seconds of system time other than to observe that some OTHER process is going full-ninja-kaboom using your CPU and I/O.
One SELECT is meant to return a number of rows, and the values on each row relate to those rows. In other words, SQL does processing based on that idea: it tries to use one operation per table to get the results you asked for. Let's look at your SELECT:
select 'max id & count distinct:', max(id), count(distinct id)
max(id) only needs one row to be retrieved: the one which has the biggest value for
id. Since there's an index on this value, it can do this by looking at whatever row is last in this index. Fast and easy to find.
But then in the same query you ask for
count(distinct id). Which requires more than one row to be read from the table. It has to scan a whole index. Well, that's okay, it has the index it needs.
But in the combined query, SQL is meant to retrieve both these figures with the same operation. That could be a lot more complicated. Let's see what it really does. Let's look at EXPLAIN QUERY PLAN on all three of those queries:
sqlite> EXPLAIN QUERY PLAN select 'max id:', max(id) from pro_comment; QUERY PLAN `--SEARCH pro_comment USING COVERING INDEX pro_comment_id sqlite> EXPLAIN QUERY PLAN select 'count distinct:', count(distinct id) from pro_comment; QUERY PLAN `--SCAN pro_comment USING COVERING INDEX pro_comment_id sqlite> EXPLAIN QUERY PLAN select 'max id & count distinct:', max(id), count(distinct id) from pro_comment; QUERY PLAN |--USE TEMP B-TREE FOR count(DISTINCT) `--SCAN pro_comment USING COVERING INDEX pro_comment_id
As you see, SQLite, in trying to get both figures from the same search, thinks it has to make a temporary index of the data. The solution is to accept that you want two different figures that aren't related, and ask for them in separate statements. Which is what you did yourself in your demonstration of the problem. Well done.
It could be that this particular optimisation is easy and fast to identify. In which case, the developers (who are reading your thread) might add it. But if you have SQLite check for every possible optimisation every time it does a search, every search is going to take a long time while SQLite checks to see if it qualifies for every possible optimisation. Which, for most SELECTs most of the time, will slow SQLite down.
For some reason the optimizer chooses to use a separate b-tree for distinct even though there is already an index. It does this even for the base query
select count(distinct id) from pro_comment sometimes, even though scanning the existing index would be more efficient (lower cost).
Generating the separate b-tree requires an additional scan for no benefit.
I presume that this is because the optimizer is not considering the more direct solution of merely doing a single scan of the index to generate the result.