Avoiding recursive CTE materialisation

(1) By Pepijn Van Eeckhoudt (pepijnve) on 2021-08-15 11:06:15 [link] [source]

In a database with a simple adjacency list style table named export

CREATE TABLE IF NOT EXISTS "export"
(
    "id" INTEGER NOT NULL,
    "parentId" INTEGER NOT NULL,
    "name" TEXT NOT NULL,
    PRIMARY KEY ("id")
);

I'm trying to write a query that starts at a given entry and walks up the tree until it finds an entry that matches a certain condition. One example I'm experimenting with is this query

WITH RECURSIVE
    parents(id) AS NOT MATERIALIZED (
        SELECT parentId FROM export WHERE export.id = 60363923
        UNION ALL
        SELECT export.parentId FROM export JOIN parents ON export.id = parents.id
    )
SELECT export.id, export.parentId FROM parents JOIN export ON parents.id = export.parentId WHERE export.id > 60363923 LIMIT 1;

even though I suggested the CTE not be materialized the query plan I'm getting is

QUERY PLAN
|--MATERIALIZE parents
|  |--SETUP
|  |  `--SEARCH export USING INTEGER PRIMARY KEY (rowid=?)
|  `--RECURSIVE STEP
|     |--SCAN parents
|     `--SEARCH export USING INTEGER PRIMARY KEY (rowid=?)
|--SCAN parents
`--SEARCH export USING COVERING INDEX export_parent_id_idx (parentId=? AND rowid>?)

I'm trying to understand why the query planner decides to materialize parents rather than using a coroutine and requesting one row at a time from the coroutine. Is there a way to restructure this kind of query so that the CTE is lazily evaluated rather than eager?

(2.1) By Keith Medcalf (kmedcalf) on 2021-08-15 17:30:25 edited from 2.0 in reply to 1 [link] [source]

Re-phrase the query so as to store the results you want in the CTE and avoid redundant searches.

WITH parents(pid, cid)
  AS (
         SELECT parentId,
                id
           FROM export
          WHERE export.id = 60363923
      UNION ALL
         SELECT export.parentId,
                export.id
           FROM parents, export
          WHERE export.id = parents.pid
     )
SELECT cid,
       pid
  FROM parents
 WHERE cid > 60363923
 LIMIT 1
;

which has plan:

QUERY PLAN
|--CO-ROUTINE parents
|  |--SETUP
|  |  `--SEARCH export USING INTEGER PRIMARY KEY (rowid=?) (~1 row)
|  `--RECURSIVE STEP
|     |--SCAN parents (~1048576 rows)
|     `--SEARCH export USING INTEGER PRIMARY KEY (rowid=?) (~1 row)
`--SCAN parents (~983040 rows)

(3) By Pepijn Van Eeckhoudt (pepijnve) on 2021-08-16 10:22:25 in reply to 2.1 [link] [source]

Thanks for the answer. While this does indeed avoid materialising the CTE entirely up front, based on my reading of the VDBE trace it seems quite a bit less efficient.

If I'm interpreting the traces correctly, each recursive step gets executed in its entirety. That means that, in this example, all children with the parentId 55115879 (which is the parent id of the row with id 60363923) are first added to the ephemeral table. Once that's done then the cid > 60363923 clause is evaluated against the first row of the ephemeral table. Then the recursion step is executed with this row which ends up selecting the same rows as the first recursion step since the the parentId is identical. And so on... If the result is not found in the first parent, then I think this risks recursing infinitely. Solveable by using union instead of union all, but that will probably only add more overhead.

It remained puzzling to me why the query planner would materialise a CTE that's the outermost loop in the join, is used once in the query, and is side effect free. To make sure the loops were being nested the way I expected them to be, I tried forcing the order using CROSS JOIN and that produces exactly the VDBE trace I was hoping to get.

That leaves me with the question why the query planner decided to use the for loop nesting order it did. Is there any way to get debug/trace output for the decisions the query planner makes when choosing between various options?

Perhaps not the most useful metric, but just FYI the vdbe traces give me these results:

original query: 102 opcodes
suggested alternative: 2310 opcodes
original with cross join: 42 opcodes

(4) By David Raymond (dvdraymond) on 2021-08-16 12:49:16 in reply to 1 [link] [source]

Perhaps something like this? (Haven't tested so might not work)

with recursive parents (id, parentId, name, found_what_i_want) as (
    select *, false from export where id = 60363923
    union all
    select export.*, export.id > 60363923 as found_what_i_want
    from export inner join parents on export.id = parents.parentId
    where not found_what_i_want
)
select id, parentId, name from parents where found_what_i_want;

(5) By Keith Medcalf (kmedcalf) on 2021-08-17 03:47:22 in reply to 4 [source]

That will not work because, while it will stop searching the branch of the tree beyond the desired termination condition, other branches not containing the desired termination condition (found_what_I_want) will be searched to exhaustion. Only after a search of every branch has either found the terminating condition or been exhausted will the processing end. Basically this does something less than a full exhaustive search, but not by much.

If the CTE were not materialized then the recursive query would run returning one row by each for inspection for the terminating condition which, if found, would terminate the entire statement processing (all branches, not just the one containing the desired condition).

(6) By Pepijn Van Eeckhoudt (pepijnve) on 2021-08-17 08:43:40 in reply to 5 [link] [source]

If the CTE were not materialized then the recursive query would run returning one row by each for inspection for the terminating condition which, if found, would terminate the entire statement processing (all branches, not just the one containing the desired condition).

That's indeed precisely the intention. My original question was not really about query correctness. I was trying to figure out how I can get the engine to execute it as efficiently as possible. Some more background information might clarify things.

In this particular database, the rows are inserted in bulk in a controlled order and the database is read-only after that. The entries are added in depth-first order where each entry is assigned an incrementing id on insertion. This has some interesting properties:

For each parent P of any entry E, P.id < E.id
For each disjoint subtree with root entries S1 and S2, if S1.id < S2.id then for each child C1 of S1 C1.id < S2.id.

These properties let you perform nested set style queries on the dataset. The lower bound value for a subtree with root R is R.id itself. The query from my question is intended to find the upper bound value. It visits each parent P of R from deep to shallow and queries for the smallest id value of an entry E where E.parentId = P.id and E.id > R.id. If there is no such entry, R is itself (or is a child of) the last entry in P and you need to inspect the next parent. Once you have the lower and upper bounds, selecting an entire subtree can be done using where id > lower and id < upper without requiring recursion.

Why do I want to do this? 99% of the queries I'll be doing are single level queries for the direct children or direct parent of an entry. The adjacency list is the most practical for those purposes. For those cases where an entire subtree needs to be queried, I can do a relatively cheap recursive query which lets me do the subtree query using a simple index scan instead of a recursive query.