SUM() of index values

(1.2) By Wout Mertens (wmertens) on 2022-05-01 10:40:30 edited from 1.1 [link] [source]

Hi,

I came across a Twitter thread about summing the length of blobs stored in a DB. It was very slow because it needed to read all the blobs.

I proposed using an index on length(chunk), but EXPLAIN doesn't show that this index is used to calculate the sum.

However, in the thread he shows that it seems to work: https://twitter.com/deliberatecoder/status/1520562387979542529

How can this be?

(sqlite 3.38)

sqlite> create table f(id, chunk);
sqlite> create index l on f(id, length(chunk));
sqlite> explain query plan select sum(length(chunk)) from f group by id;
QUERY PLAN
`--SCAN f USING INDEX l
sqlite> explain select sum(length(chunk)) from f group by id;
addr  opcode         p1    p2    p3    p4             p5  comment      
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     36    0                    0   Start at 36
1     Noop           1     2     0                    0   
2     Integer        0     4     0                    0   r[4]=0; clear abort flag
3     Null           0     7     7                    0   r[7..7]=NULL
4     Gosub          6     32    0                    0   
5     OpenRead       0     2     0     2              0   root=2 iDb=0; f
6     OpenRead       2     3     0     k(3,,,)        0   root=3 iDb=0; l
7     Explain        7     0     0     SCAN f USING INDEX l  0   
8     Rewind         2     22    9     0              0   
9       DeferredSeek   2     0     0                    0   Move 0 to 2.rowid if needed
10      Column         2     0     8                    0   r[8]=f.id
11      Compare        7     8     1     k(1,B)         0   r[7] <-> r[8]
12      Jump           13    17    13                   0   
13      Move           8     7     1                    0   r[7]=r[8]
14      Gosub          5     26    0                    0   output one row
15      IfPos          4     35    0                    0   if r[4]>0 then r[4]-=0, goto 35; check abort flag
16      Gosub          6     32    0                    0   reset accumulator
17      Column         0     1     10                   64  r[10]=f.chunk
18      Function       0     10    9     length(1)      0   r[9]=func(r[10])
19      AggStep        0     9     1     sum(1)         1   accum=r[1] step(r[9])
20      Integer        1     3     0                    0   r[3]=1; indicate data in accumulator
21    Next           2     9     0                    1   
22    Gosub          5     26    0                    0   output final row
23    Goto           0     35    0                    0   
24    Integer        1     4     0                    0   r[4]=1; set abort flag
25    Return         5     0     0                    0   
26    IfPos          3     28    0                    0   if r[3]>0 then r[3]-=0, goto 28; Groupby result generator entry point
27    Return         5     0     0                    0   
28    AggFinal       1     1     0     sum(1)         0   accum=r[1] N=1
29    Copy           1     11    0                    0   r[11]=r[1]
30    ResultRow      11    1     0                    0   output=r[11]
31    Return         5     0     0                    0   end groupby result generator
32    Null           0     1     2                    0   r[1..2]=NULL
33    Integer        0     3     0                    0   r[3]=0; indicate accumulator empty
34    Return         6     0     0                    0   
35    Halt           0     0     0                    0   
36    Transaction    0     0     2     0              1   usesStmtJournal=0
37    Goto           0     1     0                    0

If I'm reading this right, in 10 it grabs the id from the index (I'm guessing that's the first value, 2, even though in the description it says f.id).

However, in 17 it definitely reads the chunk value in memory and then calculates the length.

So:

Should SUM() not use indexed values?
How can this be fast if that's the query plan?

Cheers,

Wout.

(2) By Simon Slavin (slavin) on 2022-05-01 18:19:48 in reply to 1.2 [link] [source]

I can't answer your specific question, but I would have done it using a generated column:

https://www.sqlite.org/gencol.html

something like

CREATE TABLE f( id INTEGER PRIMARY KEY, chunkLength INT GENERATED ALWAYS AS (length(chunk)) STORED, chunk BLOB );

Haven't checked it using EXPLAIN but it should be efficient, at the cost of taking a few more bytes per row.

(3) By Daniel H (dholth) on 2022-05-01 18:45:38 in reply to 2 [link] [source]

It's probably forced to read at least one page per 32kb chunk no matter if the length is generated or not. However if every column used in the query (id, chunk id, length) it needn't touch the table.

(4) By Wout Mertens (wmertens) on 2022-05-01 19:06:24 in reply to 3 [source]

Well, isn't that the case here? All the needed data is in the index and yet it uses the table?

(5) By Wout Mertens (wmertens) on 2022-05-01 19:07:37 in reply to 2 [link] [source]

I think this will cause reading all the very full rows and still being slow?

(6.1) By Daniel H (dholth) on 2022-05-01 21:11:36 edited from 6.0 in reply to 5 [link] [source]

Do you have lots of rows in your test db? The original also has id, chunk id, blob

(7) By Richard Hipp (drh) on 2022-05-01 22:22:59 in reply to 1.2 [link] [source]

On instruction 17, the P5 argument is 64 (or 0x40) which is the OPFLAG_LENGTHARG flag. According to the documentation:

"If the OPFLAG_LENGTHARG and OPFLAG_TYPEOFARG bits are set on P5 then the result is guaranteed to only be used as the argument of a length() or typeof() function, respectively. The loading of large blobs can be skipped for length() and all content loading can be skipped for typeof()."

If the value is text, then the entire content must be loaded, since the length() function returns the character length of text, not the byte length. So it has to scan for multi-byte characters. But if the content is BLOB, it does not actually read the data. It only reads the byte-length of the data out of the row header, because that is the only information that the length() function needs for a BLOB. the