index misuse?
(1.1) By wulian on 2021-04-08 07:26:02 edited from 1.0 [source]
Hi everyone,
Consider the following example:
CREATE TEMPORARY TABLE t0(c0 INT);
CREATE TEMPORARY TABLE t1(c0 INT, c1 INT);
CREATE INDEX t1_c0 ON t1(c0);
INSERT into t0(c0) VALUES (1), (-1), (2);
INSERT INTO t1(c0,c1) VALUES (1, 2), (2, -1), (-1, 1);
.wheretrace
SELECT * FROM t0 CROSS JOIN t1 ON t1.c0 == t0.c0 AND t1.c1 == t0.c0;
I enable debugging and use the ".wheretrace" command to obtain wheretrace information, part of the output is as follows:
---- WHERE clause at start of analysis:
TERM-0 56425A4276C0 .E.. left={1:0} op=802 wtFlags=0008 prob=1 prereq=3,1
'-- EQ
|-- {1:0} pTab=56425A430150 fg.af=800000.n
'-- {0:0} pTab=56425A42E260 fg.af=800000.n
TERM-1 56425A4276F8 .E.. left={1:1} op=802 wtFlags=0008 prob=1 prereq=3,1
'-- EQ
|-- {1:1} pTab=56425A430150 fg.af=800000.n
'-- {0:0} pTab=56425A42E260 fg.af=800000.n
TERM-2 56425A427730 VE.. left={0:0} op=802 wtFlags=0003 prob=1 prereq=3,2 iParent=1
'-- EQ
|-- {0:0} pTab=56425A42E260 fg.af=800000.n
'-- {1:1} pTab=56425A430150 fg.af=800000.n
TERM-3 56425A427768 VE.. left={0:0} op=802 wtFlags=0003 prob=1 prereq=3,2 iParent=0
'-- EQ
|-- {0:0} pTab=56425A42E260 fg.af=800000.n
'-- {1:0} pTab=56425A430150 fg.af=800000.n
add: * 0.01.02 t0 f 04000 N 1 cost 271,53,43
TERM-0 56425A427730 VE.. left={0:0} op=802 wtFlags=0003 prob=1 prereq=3,2 iParent=1
'-- EQ
|-- {0:0} pTab=56425A42E260 fg.af=800000.n
'-- {1:1} pTab=56425A430150 fg.af=800000.n
skip: * 0.01.02 t0 f 04000 N 1 cost 271,53,43
TERM-0 56425A427768 VE.. left={0:0} op=802 wtFlags=0003 prob=1 prereq=3,2 iParent=0
'-- EQ
|-- {0:0} pTab=56425A42E260 fg.af=800000.n
'-- {1:0} pTab=56425A430150 fg.af=800000.n
add: * 0.01.00 t0 f 00100 N 0 cost 0,216,200
BEGIN t0.addBtreeIdx(), nEq=0, nSkip=0, rRun=216
END t0.addBtreeIdx(), nEq=0, rc=0
add: * 1.02.01 t1 f 04000 N 1 cost 271,53,43
TERM-0 56425A4276C0 .E.. left={1:0} op=802 wtFlags=0008 prob=1 prereq=3,1
'-- EQ
|-- {1:0} pTab=56425A430150 fg.af=800000.n
'-- {0:0} pTab=56425A42E260 fg.af=800000.n
skip: * 1.02.01 t1 f 04000 N 1 cost 271,53,43
TERM-0 56425A4276F8 .E.. left={1:1} op=802 wtFlags=0008 prob=1 prereq=3,1
'-- EQ
|-- {1:1} pTab=56425A430150 fg.af=800000.n
'-- {0:0} pTab=56425A42E260 fg.af=800000.n
add: * 1.02.01 t1 f 00100 N 0 cost 0,216,180
BEGIN t1.addBtreeIdx(), nEq=0, nSkip=0, rRun=216
END t1.addBtreeIdx(), nEq=0, rc=0
BEGIN t1.addBtreeIdx(t1_c0), nEq=0, nSkip=0, rRun=216
replace: * 1.02.01 t1 f 04000 N 1 cost 271,53,43
TERM-0 56425A4276C0 .E.. left={1:0} op=802 wtFlags=2008 prob=1 prereq=3,1
'-- EQ
|-- {1:0} pTab=56425A430150 fg.af=800000.n
'-- {0:0} pTab=56425A42E260 fg.af=800000.n
with: * 1.02.01 t1.t1_c0 1 f 10201 N 1 cost 0,62,32
TERM-0 56425A4276C0 .E.. left={1:0} op=802 wtFlags=2008 prob=1 prereq=3,1
'-- EQ
|-- {1:0} pTab=56425A430150 fg.af=800000.n
'-- {0:0} pTab=56425A42E260 fg.af=800000.n
delete: * 1.02.01 t1 f 00100 N 0 cost 0,216,180
BEGIN t1.addBtreeIdx(t1_c0), nEq=1, nSkip=0, rRun=62
END t1.addBtreeIdx(t1_c0), nEq=1, rc=0
skip: * 1.02.01 t1.t1_c0 1 f 10201 N 1 cost 0,62,32
TERM-0 56425A4276F8 .E.. left={1:1} op=802 wtFlags=2008 prob=1 prereq=3,1
'-- EQ
|-- {1:1} pTab=56425A430150 fg.af=800000.n
'-- {0:0} pTab=56425A42E260 fg.af=800000.n
BEGIN t1.addBtreeIdx(t1_c0), nEq=1, nSkip=0, rRun=62
END t1.addBtreeIdx(t1_c0), nEq=1, rc=0
END t1.addBtreeIdx(t1_c0), nEq=0, rc=0
0 0.01.02 t0 f 04000 N 1 cost 271,53,43
TERM-0 56425A427730 VE.. left={0:0} op=802 wtFlags=0003 prob=1 prereq=3,2 iParent=1
'-- EQ
|-- {0:0} pTab=56425A42E260 fg.af=800000.n
'-- {1:1} pTab=56425A430150 fg.af=800000.n
1 0.01.00 t0 f 00100 N 0 cost 0,216,200
2 1.02.01 t1.t1_c0 1 f 10201 N 1 cost 0,62,32
TERM-0 56425A4276C0 .E.. left={1:0} op=802 wtFlags=2008 prob=1 prereq=3,1
'-- EQ
|-- {1:0} pTab=56425A430150 fg.af=800000.n
'-- {0:0} pTab=56425A42E260 fg.af=800000.n
It is weird that SQLite tries to use the index on t1.c0 for an equal search t1.c1 == t0.c0. I can not find a test case to get a mistake result caused by this whereloop object, since this whereloop object usually dropped due to the same predict cost as a previous one.
skip: * 1.02.01 t1.t1_c0 1 f 10201 N 1 cost 0,62,32
TERM-0 56425A4276F8 .E.. left={1:1} op=802 wtFlags=2008 prob=1 prereq=3,1
'-- EQ
|-- {1:1} pTab=56425A430150 fg.af=800000.n
'-- {0:0} pTab=56425A42E260 fg.af=800000.n
I suspect that this is a misunderstanding on my side, rather than a bug. Looking forward to your explanation!
(2) By Dan Kennedy (dan) on 2021-04-08 11:18:54 in reply to 1.1 [link] [source]
Isn't it proposing that index to optimize the "t1.c0 == t0.c0" term?
(3) By wulian on 2021-04-08 11:35:18 in reply to 2 [link] [source]
It seems that it also tries using that index to optimize the "t1.c1 == t0.c0" term.
(4) By Gunter Hick (gunter_hick) on 2021-04-08 12:15:00 in reply to 3 [link] [source]
The result of EXPLAIN QUERY PLAN shows that it is using the index to implement the t1.c0 constraint. I think it is checking constraints (on each table) vs indices (on the same tables) and finds that the index is useless for the t1.c1 constraint. But it has to check anyway.
(5.1) By wulian on 2021-04-08 15:46:48 edited from 5.0 in reply to 4 [link] [source]
Yes, the query plan is right.
I believe it needs to check, too. Look at these two whereloop objects.
with: * 1.02.01 t1.t1_c0 1 f 10201 N 1 cost 0,62,32
TERM-0 56425A4276C0 .E.. left={1:0} op=802 wtFlags=2008 prob=1 prereq=3,1
'-- EQ
|-- {1:0} pTab=56425A430150 fg.af=800000.n
'-- {0:0} pTab=56425A42E260 fg.af=800000.n
skip: * 1.02.01 t1.t1_c0 1 f 10201 N 1 cost 0,62,32
TERM-0 56425A4276F8 .E.. left={1:1} op=802 wtFlags=2008 prob=1 prereq=3,1
'-- EQ
|-- {1:1} pTab=56425A430150 fg.af=800000.n
'-- {0:0} pTab=56425A42E260 fg.af=800000.n
In my opinion, SQLite finds index t1_c0 is useless for the t1.c1 constraint only because the second whereloop's cost is not better than the first one. But I think SQLite should skip the second whereloop because this index shouldn't be used for the t1.c1 constraint.
Do you think we can find a test case to make the second whereloop(using index t1_c0 for the "t1.c0==t0.c0" term) has a lower cost than the first one(using index t1_c0 for the "t1.c1==t0.c0" term)? I decrease the cost of the second whereloop by modifying the source code and find it generates a mistake query plan. But I cannot provide a test case for unmodified SQLite.
I found that the function whereScanInit
has related comments, which I quote here:
/*
** Initialize a WHERE clause scanner object. Return a pointer to the
** first match. Return NULL if there are no matches.
**
** The scanner will be searching the WHERE clause pWC. It will look
** for terms of the form "X <op> <expr>" where X is column iColumn of table
** iCur. Or if pIdx!=0 then X is column iColumn of index pIdx. pIdx
** must be one of the indexes of table iCur.
**
** The <op> must be one of the operators described by opMask.
**
** If the search is for X and the WHERE clause contains terms of the
** form X=Y then this routine might also return terms of the form
** "Y <op> <expr>". The number of levels of transitivity is limited,
** but is enough to handle most commonly occurring SQL statements.
**
** If X is not the INTEGER PRIMARY KEY then X must be compatible with
** index pIdx.
*/
In this case, X is t1.c0 and Y is t0.c0. So term "t0.c0=t1.c1" will also be returned by function whereScanNext
.
Maybe that is the reason why SQLite tries using the index on t1.c0 to optimize the "t1.c1=t0.c0" term.
I am confusing about "X must be compatible with index pIdx" in the comments.
From my view, t1.c1 is not compatible with index t1_c0.
Do you think it is a bug or not? Looking forward to your thoughts!
(6.2) By Dan Kennedy (dan) on 2021-04-08 16:02:53 edited from 6.1 in reply to 3 [link] [source]
Oh. Right - it's a transitive term. Your WHERE clause is:
t1.c0 == t0.c0 AND t1.c1 == t0.c0;
which implies:
t1.c0 == t1.c1
(because they're both equal to t0.c0). Therefore it can use an index on t1.c0 for constraints on t1.c1 - like "t1.c1 == t0.c0".
Simple example - index on "y" is used for "x=?":
$ ./sqlite3
SQLite version 3.36.0 2021-04-07 18:17:53
sqlite> CREATE TABLE t1(x, y);
sqlite> CREATE INDEX t1y ON t1(y);
sqlite> explain query plan SELECT * FROM t1 WHERE y=x AND x=?;
QUERY PLAN
`--SEARCH t1 USING INDEX t1y (y=?)
(7) By Hardy on 2021-04-08 16:08:13 in reply to 6.2 [link] [source]
The simplification (optimization) is not correct if t0.c0 is NULL.
(8.2) By wulian on 2021-04-09 06:26:10 edited from 8.1 in reply to 6.2 [link] [source]
Thanks for your explaintion! I understand why function whereScanNext
also return the term "Y <op> <expr>"
.
But, if I force the query planner(by modifying source code) to choose that weird whereloop(using index t1_c0 for constrains t1.c1==t0.c0). It generates incorrect BYTECODE. It first uses the value of t0.c0 as key to search t1.c0 through index t1_c0, then it judge t1.c0 == t0.c0 again. The query result is also incorrect. I think it needs judge "t1.c0 == t1.c1" after index searching.
The query plan and incorrect BYTECODE are shown below.
QUERY PLAN
|--SCAN TABLE t0
`--SEARCH TABLE t1 USING INDEX t1_c0 (c0=?)
addr opcode p1 p2 p3 p4 p5 comment
---- ------------- ---- ---- ---- ------------- -- -------------
0 Init 0 25 0 0 Start at 25
1 OpenRead 0 2 1 1 0 root=2 iDb=1; t0
2 OpenRead 1 3 1 2 0 root=3 iDb=1; t1
3 OpenRead 2 4 1 k(2,,) 2 root=4 iDb=1; t1_c0
4 Explain 4 0 0 SCAN TABLE t0 0
5 Rewind 0 24 0 0
6 Explain 6 0 0 SEARCH TABLE t1 USING INDEX t1_c0 (c0=?) 0
7 Column 0 0 1 0 r[1]=t0.c0
8 IsNull 1 23 0 0 if r[1]==NULL goto 23
9 Affinity 1 1 0 C 0 affinity(r[1])
10 SeekGE 2 23 1 1 0 key=r[1]
11 IdxGT 2 23 1 1 0 key=r[1]
12 DeferredSeek 2 0 1 0 Move 1 to 2.rowid if needed
13 Column 2 0 2 0 r[2]=t1.c0
14 Column 0 0 3 0 r[3]=t0.c0
15 Ne 3 22 2 BINARY-8 83 if r[2]!=r[3] goto 22
16 ReleaseReg 2 1 0 0 release r[2] mask 0
17 ReleaseReg 3 1 0 0 release r[3] mask 0
18 Column 0 0 4 0 r[4]=t0.c0
19 Column 2 0 5 0 r[5]=t1.c0
20 Column 1 1 6 0 r[6]=t1.c1
21 ResultRow 4 3 0 0 output=r[4..6]
22 Next 2 11 1 0
23 Next 0 6 0 1
24 Halt 0 0 0 0
25 Transaction 1 0 3 0 1 usesStmtJournal=0
26 Goto 0 1 0 0
(9) By Dan Kennedy (dan) on 2021-04-08 17:15:44 in reply to 8.1 [link] [source]
How did you patch the SQLite code to get it to do this?
(10.2) By wulian on 2021-04-10 15:31:39 edited from 10.1 in reply to 9 [link] [source]
I insert two lines of code into the function whereLoopAddBtreeIndex
before it invokes the function ApplyCostMultiplier
, which reduces the cost when the index t1_c0 is used to optimize the constraint on t1.c1. Therefore, SQLite will use index t1_c0 to optimize the constraints on t1.c1.
function whereLoopAddBtreeIndex()
{
...
//the following two lines are add by myself
if(pProbe->aiColumn[saved_nEq] != pTerm->u.x.leftColumn)
pNew->rRun = sqlite3LogEstAdd(pNew->rRun -16, pNew->nOut - 16);
ApplyCostMultiplier(pNew->rRun, pProbe->pTable->costMult);
nOutUnadjusted = pNew->nOut;
pNew->rRun += nInMul + nIn;
pNew->nOut += nInMul + nIn;
whereLoopOutputAdjust(pBuilder->pWC, pNew, rSize);
rc = whereLoopInsert(pBuilder, pNew);
...
}
For WHERE clause "x=y and x=1", it is reasonable to use the index on "y" for constraints on x, since "x=y and x=1" is equivalent with "x=y and y=1". SQLite seems to transform "x=1" into "y=1"and keep "x=y" unchanged. However, as for WHERE clause "x=y, x=z", changing "x=z" to "x=y" and keeping "x=y" unchanged result in incorrect optimization("x=y and x=y"). Looking forward to your early reply!
(11) By anonymous on 2021-04-13 19:24:12 in reply to 10.2 [link] [source]
Can anyone construct a set of statements to trigger the incorrect query plan?