SQLite Forum

Requesting optimization for index expression case
Login

Requesting optimization for index expression case

(1) By Deon Brewis (deonb) on 2021-06-08 19:01:44 [link] [source]

In the following example, where I have a simple table with an expression index and query:

DROP TABLE IF EXISTS test;
CREATE TABLE test(x text);
CREATE INDEX testIndex ON test(length(x));

EXPLAIN SELECT length(x) FROM test INDEXED BY testIndex;

It uses the execution plan below, which opens both the table as well as index and do a DeferredSeek of the table. Note, it doesn't actually access anything from the main table (i.e. it doesn't retrieve the value and execute length on it - it merely returns the precomputed value from the index (good) ), but it still seeks into the main table:

addr	opcode	p1	p2	p3	p4	p5	comment
0	Init	0	9	0		0	Start at 9
1	OpenRead	0	2	0	1	0	root=2 iDb=0; test
2	OpenRead	1	3	0	k(2,,)	0	root=3 iDb=0; testIndex
3	Rewind	1	8	1	0	0	
4	DeferredSeek	1	0	0		0	Move 0 to 1.rowid if needed
5	Column	1	0	1		0	r[1]=
6	ResultRow	1	1	0		0	output=r[1]
7	Next	1	4	0		1	
8	Halt	0	0	0		0	
9	Transaction	0	0	68	0	1	usesStmtJournal=0
10	Goto	0	1	0		0	

IF I however change the index to include the column itself:

CREATE INDEX testIndex ON test(length(x), x);

It no longer does that, and instead changes the execution plan to this:

addr	opcode	p1	p2	p3	p4	p5	comment
0	Init	0	7	0		0	Start at 7
1	OpenRead	1	3	0	k(3,,,)	0	root=3 iDb=0; testIndex
2	Rewind	1	6	1	0	0	
3	Column	1	0	1		0	r[1]=
4	ResultRow	1	1	0		0	output=r[1]
5	Next	1	3	0		1	
6	Halt	0	0	0		0	
7	Transaction	0	0	71	0	1	usesStmtJournal=0
8	Goto	0	1	0		0	

This is much more ideal since it doesn't do the deferred seek and immediately return the computer result stored in the table - which it already has. However, note that it again doesn't actually read the second column I added, nor obviously recompute anything from it, it merely needs to be present in order for the index to return the first column directly.

Is it possible to remove this restriction so that the query engine can use the second execution plan without that second unused column?

Related to this, even with the 2nd column present, the optimizer doesn't recognize that testIndex is a value way to perform:

SELECT length(x) FROM test
without explicitly forcing the INDEXED BY

(2) By Richard Hipp (drh) on 2021-06-08 19:09:03 in reply to 1 [link] [source]

The whole point of OP_DeferredSeek is that it does not do a seek on the table right away, but defers that seek until it is actually needed, and if nothing is ever read from the table, the seek never occurs.

(3) By Deon Brewis (deonb) on 2021-06-08 20:14:39 in reply to 2 [source]

I see. What about the second part of the question?

Related to this, even with the 2nd column present, the optimizer doesn't recognize that testIndex is a value way to perform:

SELECT length(x) FROM test

without explicitly forcing the INDEXED BY.

(4) By Richard Hipp (drh) on 2021-06-08 20:20:35 in reply to 3 [link] [source]

Have you measured the speed to see if your preferred query plan really is faster than the query plan that SQLite actually generates?

(5) By Keith Medcalf (kmedcalf) on 2021-06-08 20:44:42 in reply to 4 [link] [source]

sqlite> create table x(x text);
sqlite> insert into x select value from wholenumber where value between 1 and 100000000;
sqlite> create index i on x(length(x));
sqlite> .eqp on
sqlite> .timer on
sqlite> select sum(length(x)) from x;
QUERY PLAN
`--SCAN x (~1048576 rows)
┌────────────────┐
│ sum(length(x)) │
├────────────────┤
│ 788888898      │
└────────────────┘
Run Time: real 9.542 user 9.531250 sys 0.000000
sqlite> select sum(length(x)) from x indexed by i;
QUERY PLAN
`--SCAN x USING INDEX i (~1048576 rows)
┌────────────────┐
│ sum(length(x)) │
├────────────────┤
│ 788888898      │
└────────────────┘
Run Time: real 12.403 user 12.375000 sys 0.000000
sqlite> create index j on x(length(x),x);
Run Time: real 30.600 user 61.453125 sys 9.109375
sqlite> select sum(length(x)) from x indexed by j;
QUERY PLAN
`--SCAN x USING COVERING INDEX j (~1048576 rows)
┌────────────────┐
│ sum(length(x)) │
├────────────────┤
│ 788888898      │
└────────────────┘
Run Time: real 9.536 user 9.531250 sys 0.000000
sqlite>

The fastest solution is to do a table scan and compute the result.

(6) By Deon Brewis (deonb) on 2021-06-08 21:45:54 in reply to 5 [link] [source]

To answer Richard - it's orders of magnitude faster with the index than table scan + recompute across the table (milliseconds vs. minutes).

Keith - in the real example, x is a 1kb blob field, and length(x) is a compute-intensive operation over that blob. This makes the index rows MUCH smaller than the table rows (200 per page instead of 4 per page), so it ends up being a lot faster.

However, I never manage to get the indexed picked up automatically on actual database queries without forcing it in. This is actually the main issue - the first part was a red herring - sorry. I figured since it repros for the small field as well I thought it was just a systemic issue rather than a miss based on analysis.

Note that if I throw a WHERE into the query it DOES indeed pick the index. E.g. This picks the index:

SELECT length(x) FROM test WHERE length(x) > 50;

But in this case I don't want to filter the query, I just want to use the index for the purpose of picking up the covering pre-computed result of the expression column, which I can't seem to get to work without forcing.

If you could confirm that the optimizer is supposed to at least sometimes be capable of automatically picking the index in such a case (no where clause, but with covering expression columns), I'll try and isolate my repro further.