WHERE NOT IN query plan optimization

(1) By ddevienne on 2022-06-27 09:54:00 [link] [source]

In the context of correcting anomalies in the data of a legacy system,
we run many queries like this:

DELETE FROM ChildTab
WHERE ParentId NOT IN (
        SELECT DISTINCT Id FROM ParentTab
      )

Where Id is not indexed (because that's not the original DB, but an index-less CTAS copy).

I'm looking at the query plan w/ and w/o DISTINCT, with it, that's:

id	parent	notused	detail
3	0	0	SCAN TABLE ChildTab
8	0	0	LIST SUBQUERY 1
11	8	0	SCAN TABLE ParentTab
20	8	0	USE TEMP B-TREE FOR DISTINCT

and w/o that's:

id	parent	notused	detail
3	0	0	SCAN TABLE ChildTab
8	0	0	LIST SUBQUERY 1
10	8	0	SCAN TABLE ParentTab

My first question is whether in the DISTINCT case, it uses the TEMP B-TREE for the WHERE NOT IN as well?
The plan output makes it look like it only uses it for the DISTINCT, not realizing it would also speed-up the WHERE NOT IN, no?

My second question is whether there's a way to force SQLite to use a temporary index to guarantee such DML will never be too slow, if either or both tables are large, to avoid the quadratic run-away behavior?

We could also manually create a temporary index ourselves, as we do in other places, but then that's more code, so if SQLite can do it implicitly somehow (or with a SQL-level hint, not code), that would be better.

Any advice on this? Thanks, --DD

(2) By Dan Kennedy (dan) on 2022-06-27 15:11:13 in reply to 1 [source]

My first question is whether in the DISTINCT case, it uses the TEMP B-TREE for the WHERE NOT IN as well? The plan output makes it look like it only uses it for the DISTINCT, not realizing it would also speed-up the WHERE NOT IN, no?

More detail is available from the regular "EXPLAIN" command. It can seem a bit cryptic at first though.

This time, in both cases the results of the sub-select are put into a temporary b-tree. Then for each row of ChildTab, SQLite does a lookup on the temporary b-tree to see if the NOT IN constraint is satisfied.

If there is a DISTINCT in the sub-query, then SQLite uses a second temporary b-tree to filter out duplicates before inserting them into the first temp b-tree (the one in the above paragraph). Which is pointless - duplicates will be filtered as they are inserted anyway.

So if there is a problem here, it's that SQLite is failing to see that the extra processing performed for DISTINCT is redundant.

Dan.

(3) By ddevienne on 2022-06-27 15:36:29 in reply to 2 [link] [source]

Thank you for taking the time to reply Dan.

It can seem a bit cryptic at first though.

I have to confess to not being able to read those indeed...

in both cases the results of the sub-select are put into a temporary b-tree

Good to know. So quadratic behavior is avoided, in all cases, great.

SQLite is failing to see that the extra processing performed for DISTINCT is redundant

Maybe one day it will gain that optimization.
Here it sounds like we just shouldn't use DISTINCT ourselves.

On a last note, I hope the more human friendly explain query plan can one day make it more explicit that the NOT IN uses a temporary B-TREE too, not just the DISTINCT.

Thanks again, --DD