SQLite Forum

big performance hit when querying 2 values in a single query instead of two
Login

big performance hit when querying 2 values in a single query instead of two

(1) By anonymous on 2021-11-21 20:26:55

i have the following schema filled with data:

<code>CREATE TABLE pro_comment(
	id         int,
	post_id    int,
	parent_id  int,
	content    text,
	created    int,
	up         int,
	down       int,
	confidence real,
	name       text,
	user_id    int,
	mark       int,
	item_id    int
);
CREATE INDEX pro_comment_id on pro_comment (id);
CREATE INDEX pro_comment_post_id__created on pro_comment (post_id, created);
CREATE INDEX pro_comment_name on pro_comment (name);</code>



and the following queries and their results:

<code>select 'max id:', max(id) from pro_comment;
max id:|52888040
Run Time: real 0.000 user 0.000000 sys 0.000082</code>

<code>select 'count distinct:', count(distinct id) from pro_comment;
count distinct:|46090875
Run Time: real 6.416 user 5.798688 sys 0.556307</code>

<code>select 'max id & count distinct:', max(id), count(distinct id) from pro_comment;
max id & count distinct:|52888040|46090875
Run Time: real 55.189 user 52.214868 sys 2.169683</code>



so when i select max(id) and count(distinct id) separately, it takes <7 seconds;
when i select them in the same query it takes 55 seconds which seems odd to me.

i am not certain if this is a bug or already explained somewhere. 
if it is not a bug, i'd appreciate if someone could link me some explanation for this behavior, so i can dodge it in the future.

observerd on debian, sqlite3 version:
3.36.0 2021-06-18 18:36:39 5c9a6c06871cb9fe42814af9c039eb6da5427a6ec28f187af7ebfb62eafaalt1

(2) By Keith Medcalf (kmedcalf) on 2021-11-21 20:54:00 in reply to 1 [link]

Use `EXPLAIN` or `EXPLAIN QUERY PLAN` preface to the queries to see what is happening.  In the CLI you can also use `.eqp on` which automagically does `explain query plan` or `.eqp full` to automagic `explain`.

Try using this form:

```
select (select max(id) from pro_comment),
       (select count(distinct id) from pro_comment)
;
```

This allows each scalar subquery to be optimized (executed) independently.  Using both max(id) and count(distinct id) in the same single query prevents the optimizer from applying some optimizations (particulary the minmax optimization).

However, even as expressed, the total time taken for the combined query will be the same.  Your last select statement `select 'max id & count distinct:', max(id), count(distinct id) from pro_comment;` should not take more time than a mere `select count(distinct id) from pro_comment;` since that requires a full scan of the table -- I cannot explain what is happening on the computer that is using the additional 47 seconds of CPU or 2 seconds of system time other than to observe that some OTHER process is going full-ninja-kaboom using your CPU and I/O.

(3) By Simon Slavin (slavin) on 2021-11-21 21:15:33 in reply to 1 [link]

One SELECT is meant to return a number of rows, and the values on each row relate to those rows.  In other words, SQL does processing based on that idea: it tries to use one operation per table to get the results you asked for.  Let's look at your SELECT:

<code>    select 'max id & count distinct:', max(id), count(distinct id)
        from pro_comment;</code>

The <code>max(id)</code> only needs one row to be retrieved: the one which has the biggest value for <code>id</code>.  Since there's an index on this value, it can do this by looking at whatever row is last in this index.  Fast and easy to find.

But then in the same query you ask for <code>count(distinct id)</code>.  Which requires more than one row to be read from the table.  It has to scan a whole index.  Well, that's okay, it has the index it needs.

But in the combined query, SQL is meant to retrieve both these figures with the same operation.  That could be a lot more complicated.  Let's see what it really does.  Let's look at EXPLAIN QUERY PLAN on all three of those queries:

<pre>sqlite> EXPLAIN QUERY PLAN select 'max id:', max(id) from pro_comment;
QUERY PLAN
`--SEARCH pro_comment USING COVERING INDEX pro_comment_id
sqlite> EXPLAIN QUERY PLAN select 'count distinct:', count(distinct id) from pro_comment;
QUERY PLAN
`--SCAN pro_comment USING COVERING INDEX pro_comment_id
sqlite> EXPLAIN QUERY PLAN select 'max id & count distinct:', max(id), count(distinct id) from pro_comment;
QUERY PLAN
|--USE TEMP B-TREE FOR count(DISTINCT)
`--SCAN pro_comment USING COVERING INDEX pro_comment_id</pre>

As you see, SQLite, in trying to get both figures from the same search, thinks it has  to make a temporary index of the data.  The solution is to accept that you want two different figures that aren't related, and ask for them in separate statements.  Which is what you did yourself in your demonstration of the problem.  Well done.

It could be that this particular optimisation is easy and fast to identify.  In which case, the developers (who are reading your thread) might add it.  But if you have SQLite check for every possible optimisation every time it does a search, every search is going to take a long time while SQLite checks to see if it qualifies for every possible optimisation.  Which, for most SELECTs most of the time, will slow SQLite down.

(4) By Keith Medcalf (kmedcalf) on 2021-11-21 21:50:37 in reply to 2 [link]

Interesting.  

For some reason the optimizer chooses to use a separate b-tree for distinct even though there is already an index.  It does this even for the base query `select count(distinct id) from pro_comment` sometimes, even though scanning the existing index would be more efficient (lower cost).  

Generating the separate b-tree requires an additional scan for no benefit.

I presume that this is because the optimizer is not considering the more direct solution of merely doing a single scan of the index to generate the result.