SQLite Forum

Query planner regression in recursive CTE after 3.44.0
Login

Query planner regression in recursive CTE after 3.44.0

(1.1) By Paul (greenscape) on 2024-05-22 14:31:57 edited from 1.0 [link] [source]

I have located the version when my problem starts appearing: 3.44.0

I'm not certain in only affects recursive CTEs. It's just the problem that we've faced after an upgrade. For some reason, having a specific content in sqlite_stat1, the query planer decides that it's a good idea to re-index the whole table that has the covering index just for this exact key pair.

I was unable to create a synthetic set of data to reproduce the problem. As well as I'm unable to share my real-world data with you. Fortunately it reproduces by simply saving a specific content (copied from real-world case) into sqlite_stat1, without the need to have having any data in any other table.

Here are the steps to reproduce. Unfortunately, in-memory database won't do as it won't re-read the contents of sqlite_stat1 after an update. Therefore we have to create a fresh foo.sqlite3 database:

$ sqlite3 foo.sqlite3


CREATE TABLE `foo`(
    `id` INTEGER PRIMARY KEY
);

CREATE TABLE `child_of_parent`(
    `child_id` INTEGER NOT NULL, 
    `parent_id` INTEGER NOT NULL, 
    `relation` INTEGER NOT NULL, 
    PRIMARY KEY(`child_id`, `parent_id`, `relation`)
);

CREATE INDEX `idx__child_of_parent__parent_id__relation` ON `child_of_parent`(`parent_id`, `relation`);

ANALYZE;

REPLACE INTO sqlite_stat1(tbl, idx, stat) VALUES
('child_of_parent', 'idx__child_of_parent__parent_id__relation', '500000 29 29'),
('child_of_parent','sqlite_autoindex_child_of_parent_1','500000 1 1 1');

Now close the database and open it again. And then issue:

EXPLAIN QUERY PLAN WITH RECURSIVE `children`(`id`) AS (
    SELECT `child_id` AS `id`
    FROM `child_of_parent`
    WHERE `parent_id` = ?001
    AND `relation` = ?002
    UNION 
    SELECT `child_id` AS `id`
    FROM `child_of_parent` JOIN `children`
      ON `child_of_parent`.`parent_id` = `children`.`id`
      AND `relation` = ?002
) SELECT count(`id`) FROM `children`;

On version 3.43.2 I observe:

QUERY PLAN
|--CO-ROUTINE children
|  |--SETUP
|  |  `--SEARCH child_of_parent USING INDEX idx__child_of_parent__parent_id__relation (parent_id=? AND relation=?)
|  `--RECURSIVE STEP
|     |--SCAN children
|     `--SEARCH child_of_parent USING INDEX idx__child_of_parent__parent_id__relation (parent_id=? AND relation=?)
`--SCAN children

While starting with version 3.44.0 I observe:

QUERY PLAN
|--CO-ROUTINE children
|  |--SETUP
|  |  `--SEARCH child_of_parent USING INDEX idx__child_of_parent__parent_id__relation (parent_id=? AND relation=?)
|  `--RECURSIVE STEP
|     |--SCAN children
|     |--BLOOM FILTER ON child_of_parent (parent_id=? AND relation=?)
|     `--SEARCH child_of_parent USING AUTOMATIC PARTIAL COVERING INDEX (parent_id=? AND relation=?)
`--SCAN children

Which in my case is a matter of 1ms vs 700ms of runtime which is a killer of performance.

First obvious solution is to drop sqlite_stat1. But given how useful it was so far and how it gave the real measurable benefits I'm hesitant to choose this path forward. It's safer to ask for your help and stay with the old version in the meantime.


Best regards, Paul

(2) By Richard Hipp (drh) on 2024-05-22 16:22:13 in reply to 1.0 [source]

Executive Summary

You can work around the issue by setting PRAGMA automatic_index=OFF. We will work on fixing it during the next release cycle. The fix won't make 3.46.0.

Analysis

Here is a simplified demonstration of the problem:

CREATE TABLE t1(id INTEGER PRIMARY KEY);
CREATE TABLE t2(cid INT, pid INT, rx INT, PRIMARY KEY(cid, pid, rx));
CREATE INDEX x1 ON t2(pid, rx);
ANALYZE sqlite_schema;
REPLACE INTO sqlite_stat1(tbl, idx, stat) VALUES
  ('t2', 'x1', '500000 250 250'),
  ('t2','sqlite_autoindex_t2_1','500000 1 1 1');
ANALYZE sqlite_schema;
.eqp on
WITH RECURSIVE children(id) AS (
    SELECT cid FROM t2 WHERE pid = ?1 AND rx = ?2
    UNION 
    SELECT cid FROM t2 JOIN children ON t2.pid = children.id AND rx = ?2
) SELECT count(id) FROM children;

The big change here (apart from simplifying the prolix table and column names) is that I changed the stat1 data for the index from "500000 29 29" to "500000 250 250", thus making a bad index even worse. With that change, the problem goes back to version 3.8.8 in 2014. A tweak to query planner weights in 3.44.0 is what brought the issue to the surface in this case, but the underlying issue has been around for much longer than that. The root cause of the problem is that the query planner is being over-optimistic about the efficacy the automatic index.

We will not attempt to fix this in the forthcoming 3.46.0 release Because:

  1. It isn't a "bug". SQLite still gets the correct answer, just not as quickly as you would hope. This is an optimization opportunity.

  2. We ought not be messing with the query planner this close to a release because doing so will likely create true bugs and/or other and worse performance regressions.

  3. You can easily work around this problem by setting "PRAGMA automatic_index=OFF;".

We will work on a fix as soon as the 3.46.0 release comes out.

Additional Notes:

  • Check-in f911f1c4977fbcae (which first appeared in 3.44.0) is what surfaced the problem for the OP.

  • Backing out check-in c52f7971e90cac10 (circa 2014, first appearing in version 3.8.8) makes the problem go away. But doing so feels more like a hack than a proper fix, and might cause other unrelated performance regressions.

(3) By Paul (greenscape) on 2024-05-22 18:23:30 in reply to 2 [link] [source]

Dr. Hipp,

Thank you so much, for such a quick, extensive and thorough reply.

It's completely understandable, we're not in any rush. Especially given that there is a PRAGMA workaround.

You help is greatly appreciated!

Best regards,

Paul