blob IS NULL significantly slower than length(blob) IS NULL

(1) By anonymous on 2022-10-13 11:24:06 [link] [source]

I came across the issue mentioned in this post where checking a blob for null in a query is quite slow.

The post suggests a workaround using length(blob) IS NULL. There are many timings in the above link, but these are my timings on a small table with blob sizes around 100k (SQLite version 3.39.4):

sqlite> select count(*) from movie;
606
Run Time: real 0.016 user 0.001014 sys 0.001111
sqlite> select count(*) from movie where poster is null;
19
Run Time: real 0.044 user 0.014245 sys 0.027969
sqlite> select count(*) from movie where length(poster) is null;
19
Run Time: real 0.002 user 0.000662 sys 0.000883

In my case, the workaround is 22x faster. This is rather counterintuitive as a simple IS NULL should be just as fast.

(2) By David Raymond (dvdraymond) on 2022-10-13 12:56:01 in reply to 1 [link] [source]

So when you do an explain on the two things you can see there's a slight difference.

explain select count(*) from t where content is null;
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
...
3     Rewind         0     8     0                    0
4       Column         0     1     2                    0   r[2]=t.content
5       NotNull        2     7     0                    0   if r[2]!=NULL goto 7
6       AggStep        0     0     1     count(0)       0   accum=r[1] step(r[0])
7     Next           0     4     0                    1
...

explain select count(*) from t where length(content) is null;
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
...
3     Rewind         0     9     0                    0
4       Column         0     1     3                    64  r[3]=t.content
5       Function       0     3     2     length(1)      0   r[2]=func(r[3])
6       NotNull        2     8     0                    0   if r[2]!=NULL goto 8
7       AggStep        0     0     1     count(0)       0   accum=r[1] step(r[0])
8     Next           0     4     0                    1
...

You can see that the p5 argument to Column is different. If we look at the Column opcode information, we see this:

"If the OPFLAG_LENGTHARG and OPFLAG_TYPEOFARG bits are set on P5 then the result is guaranteed to only be used as the argument of a length() or typeof() function, respectively. The loading of large blobs can be skipped for length() and all content loading can be skipped for typeof()."

So it knows the field is only getting used in a call to length(), so it doesn't bother loading it at all, because the length is part of the field header.

Going by the note we can also test this with

select count(*) from t where typeof(content) = 'null';

and you'll find that is also fast.

Apparently that optimization only goes for when the field is used as the argument to length() or typeof() and not when it's only used with a "is null" or "is not null" postfix operator.

(3) By ddevienne on 2022-10-13 13:15:50 in reply to 2 [link] [source]

Nice analysis, thanks.

that optimization only goes for [...] length() or typeof()

Those functions only look at the record header, not the record body, thus they don't indeed need the values.
But the same goes for the IS [NOT] NULL operators, so in theory the same optimization should apply (eventually).
At least that's my understanding of this.

(5) By David Raymond (dvdraymond) on 2022-10-13 15:38:51 in reply to 3 [link] [source]

The team being the awesome group they are, it looks like "eventually" = "today".

https://www.sqlite.org/src/info/cb94350185f555c3

(4) By David Raymond (dvdraymond) on 2022-10-13 13:40:10 in reply to 2 [source]

By the way, there's one typo, and a confusing bit (or second typo) in the description of the Column Opcode

Interpret the data that cursor P1 points to as a structure built using the MakeRecord instruction. (See the MakeRecord opcode for additional information about the format of the data.) Extract the P2-th column from this record. If there are less that (P2+1) values in the record, extract a NULL.

The value extracted is stored in register P3.

If the record contains fewer than P2 fields, then extract a NULL. Or, if the P4 argument is a P4_MEM use the value of the P4 argument as the result.

If the OPFLAG_LENGTHARG and OPFLAG_TYPEOFARG bits are set on P5 then the result is guaranteed to only be used as the argument of a length() or typeof() function, respectively. The loading of large blobs can be skipped for length() and all content loading can be skipped for typeof().

"If there are less that (P2+1)" should be "less than"

And it's a little confusing that the first paragraph says if there are less than P2 +1 values extract NULL, while the third paragraph says fewer than P2 fields extract NULL. If it's 0 based I think they should both be P2+1