Simple COUNT/GROUP BY query using a temporary table

(1) By anonymous on 2020-12-23 14:11:09 [link] [source]

I have a large table with schema like this (no other indexes are present):

CREATE TABLE data(
    x TEXT PRIMARY KEY,
    type   INTEGER
) WITHOUT ROWID;

When I execute a query SELECT COUNT(*), type FROM data GROUP BY type, apparently, SQLite creates a huge temporary table instead of doing aggregate/grouping in one pass.

sqlite> EXPLAIN QUERY PLAN SELECT COUNT(*), type FROM data GROUP BY type;
QUERY PLAN
|--SCAN TABLE data
`--USE TEMP B-TREE FOR GROUP BY

sqlite> EXPLAIN SELECT COUNT(*), type FROM data GROUP BY type;
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     41    0                    0   Start at 41
1     SorterOpen     1     1     0     k(1,B)         0
2     Integer        0     4     0                    0   r[4]=0; clear abort flag
3     Null           0     7     7                    0   r[7..7]=NULL
4     Gosub          6     37    0                    0
5     OpenRead       2     2     0     k(1,)          0   root=2 iDb=0; sqlite_autoindex_data_1
6     Explain        6     0     0     SCAN TABLE data  0
7     Rewind         2     12    9     0              0
8       Column         2     1     9                    0   r[9]=data.type
9       MakeRecord     9     1     10                   0   r[10]=mkrec(r[9])
10      SorterInsert   1     10    0                    0   key=r[10]
11    Next           2     8     0                    1
12    OpenPseudo     3     9     1                    0   1 columns in r[9]
13    SorterSort     1     40    0                    0   GROUP BY sort
14      SorterData     1     9     3                    0   r[9]=data
15      Column         3     0     8                    0   r[8]=
16      Compare        7     8     1     k(1,B)         0   r[7] <-> r[8]
17      Jump           18    22    18                   0
18      Move           8     7     1                    0   r[7]=r[8]
19      Gosub          5     31    0                    0   output one row
20      IfPos          4     40    0                    0   if r[4]>0 then r[4]-=0, goto 40; check abort flag
21      Gosub          6     37    0                    0   reset accumulator
22      AggStep        0     0     1     count(0)       0   accum=r[1] step(r[0])
23      If             3     25    0                    0
24      Column         3     0     2                    0   r[2]=data.type
25      Integer        1     3     0                    0   r[3]=1; indicate data in accumulator
26    SorterNext     1     14    0                    0
27    Gosub          5     31    0                    0   output final row
28    Goto           0     40    0                    0
29    Integer        1     4     0                    0   r[4]=1; set abort flag
30    Return         5     0     0                    0
31    IfPos          3     33    0                    0   if r[3]>0 then r[3]-=0, goto 33; Groupby result generator entry point
32    Return         5     0     0                    0
33    AggFinal       1     0     0     count(0)       0   accum=r[1] N=0
34    Copy           1     11    1                    0   r[11..12]=r[1..2]
35    ResultRow      11    2     0                    0   output=r[11..12]
36    Return         5     0     0                    0   end groupby result generator
37    Null           0     1     2                    0   r[1..2]=NULL
38    Integer        0     3     0                    0   r[3]=0; indicate accumulator empty
39    Return         6     0     0                    0
40    Halt           0     0     0                    0
41    Transaction    0     0     3     0              1   usesStmtJournal=0
42    Goto           0     1     0                    0

This is really obvious from looking at process memory consumption when PRAGMA temp_store = MEMORY enabled - it indeed consumes many gigabytes of RAM (the original database is roughly of the same size).

Am I doing something wrong, or this is a known limitation?

(2.1) By Keith Medcalf (kmedcalf) on 2020-12-23 14:59:38 edited from 2.0 in reply to 1 [link] [source]

There are only three possible ways to compute the results you have requested, given a completely unconstrained intelligent solver (such as a human computer programmer):

Iterate the source table in order by type outputting the aggregate as it is computed;
Create a temporary table containing all the type in sorted order, then read the sorted table outputting the aggregate as it is computed; or,
Create a temporary table for the tabulation then traverse the source table in whatever order is convenient, updating the temporary tabulator row as appropriate, then at the end of the tabulation, output the temporary tabulator table.

Method 1 cannot be used because the table is not in order by type.

Method 3 cannot be used because SQLite does not use that method. (I do not think it is presently an available choice).

That leaves Method 2, which is how the problem is being solved.

So by "limitation" do you mean that method 3 is not being used/available as a solution method?

(3) By anonymous on 2020-12-23 15:14:03 in reply to 2.1 [link] [source]

Yes.

I forgot to clarify that the number of distinct type values is very small: it's integers from 0 to 3. Which makes the Method 3 even more suitable here, as its extra memory complexity is only O(distinct "type" values).

(4) By Keith Medcalf (kmedcalf) on 2020-12-23 16:05:32 in reply to 3 [link] [source]

An outside observer cannot possibly know the cardinality of the type column.

If this column were indexed then there could be helpful index statistics available. However, if there were an index on this column then the rows could be traversed in order by method 1 and no temporary table would be required at all.

(6) By anonymous on 2020-12-23 16:24:05 in reply to 4 [link] [source]

Absolutely. But when no index is available, Method 3 is better than Method 2 both runtime- and memory complexity-wise regardless of type cardinality, right? Low cardinality is simply an extreme case (NOT in the sense that it's artificial and never happens in real cases, though) where it's much-much better.

(7) By Keith Medcalf (kmedcalf) on 2020-12-23 16:40:33 in reply to 6 [link] [source]

Yes, but as far as I know SQLite3 does not presently implement that as a solution method. I do not know what the complexity of identifying instances where it could be implemented as a solution method either.

As a generalization, it would probably be a good thing, but I cannot estimate the complexity of proper implementation or the cost of support going forward since such an optimization would have to be proven bug free and supported for a number of years.

(5.1) By Keith Medcalf (kmedcalf) on 2020-12-23 16:41:46 edited from 5.0 in reply to 3 [link] [source]

Note that if you really wanted to you could implement Method 3 as follows:

begin;
create temporary table tabulation (type integer not null unique, count integer default 1);
insert into tabulation (type) select type from data where type is not null on conflict (type) do update set count=count+1;
select * from tabulation;
rollback;

NB: Dealing with NULL is problematic so this just does away with that problem.

(8) By J-L Hainaut (JLHainaut) on 2020-12-25 18:01:30 in reply to 3 [source]

Under this strict condition on the value set of "type", this query should work (not checked):

select sum(case when type = 0 then 1 else 0 end) as Type0, sum(case when type = 1 then 1 else 0 end) as Type1, ..., sum(case when type not in (0,1,2,3) then 1 else 0 end as Other from data;