SQLite Forum

Proposed non-standard enhancement to common table expressions
Login

Proposed non-standard enhancement to common table expressions

(1.1) By Richard Hipp (drh) on 2021-02-15 14:29:04 edited from 1.0 [link] [source]

It is proposed to enhance the common table expression mechanism of SQLite with non-standard syntax as follows:

WITH RECURSIVE cte-table-name GENERATED AS ( select-stmt ) ,
     linerad = 10px
     linewid *= 0.5 
     $h = 0.21
     # debug_label_color=red; color=lightgray; margin=1cm

C0:  circle radius 10%
     arrow 2*arrowht
     oval "WITH" fit
     arrow right linerad then up $h then right linerad+arrowht
     oval "RECURSIVE" fit
     line right linerad then down even with WITH then right linerad
     arrow linerad+arrowht
CTN: box "cte-table-name" fit
     arrow right linerad then up 0.9*$h then right linerad+arrowht color red
     oval "GENERATED" fit color red
     arrow right linerad then down even with CTN then right linerad+arrowht \
        color red
     oval "AS" fit
     arrow 2*arrowht
     oval "(" bold fit
     arrow same
     box "select-stmt" fit
     arrow same
RP:  oval ")" bold fit
     arrow
     circle same

     arrow from CTN.e then right even with GENERATED
     line right to AS.w

C1:  oval "," bold fit at (.5<CTN.w,RP.e>,1.25*$h below AS)
     arrow from RP.e right linerad then down even with C1 then to C1.e
     line from C1.w left even with linerad+arrowht west of CTN.w \
        then up even with CTN then right linerad

     line from WITH.e to arrowht west of CTN.w

The new GENERATED keyword, shown in red, would cause SQLite to implement the CTE more like PostgreSQL:

  1. The CTE is an "optimization barrier", which I understand to mean that terms in the WHERE clause of the CTE do not leak into the query that uses the CTE, and WHERE clause terms of the query using the CTE do not leak into the CTE.

  2. The CTE is guaranteed to be evaluated no more than once. The CTE might not be evaluated at all, or the evaluation might not run to completion. But there will not be multiple evaluations of the CTE.

In the current prototype implementation of this idea, the GENERATED keyword forces the CTE to be materialized. But this is not a requirement. A more refined implementation might also implement the CTE using a co-routine, if that is a possibility.

The GENERATED keyword is a query planner hint or constraint. An equivalent answer should be generated regardless of the presence or absence of the GENERATED keyword.

"GENERATED" is used for this syntax because "GENERATED" is already a keyword in standard SQL and in SQLite. The new syntax does not require a new keyword, therefore, but merely reuses an existing keyword for a new purpose.

Comments? Objections? Concerns?

(2) By anonymous on 2021-02-15 14:56:36 in reply to 1.1 [link] [source]

I want to try this. I am using a lot of CTE's in an application.

Where can I download the source to build the experimental SQLITE?

(3.1) By Ryan Smith (cuz) on 2021-02-15 14:58:02 edited from 3.0 in reply to 1.0 [link] [source]

Thank your for this!

There are very many examples of helpful replies to questions on this forum that teach the questioners to materialize views, sub-queries or CTE's with some tricks/hacks to get it to speed up.

This will replace those with a legitimate/intuitive keyword to force it for at least CTEs, which calls for applause.

I have one question based on your point 2: If leaf optimizations and WHERE clauses are barriered out, In what circumstance might the CTE not be evaluated at all?

I assume the simplified process is:
```

  • See if first table (FROM clause) produces a record:
    • If so, see if a secondary table (JOIN) does too, if this table is a CTE:
    • start evaluating it
    • If this is a generated CTE, build it and materialize it ``` In which case I assume if the first table does not produce a row, any JOINed CTE will never be evaluated and thus never be executed/generated/materialized.

That would also imply that I can force the materialization of a CTE by making it be the first table after the FROM keyword and also GENERATED.

If this is correct, I am overjoyed.
If I am wrong or if there are other reasons why a CTE won't be evaluated, kindly say. I'm not suggesting any change based on this, just wish to fully understand the method and perhaps offering a reason to make specific note of it in the documentation.

Why do I care? - I sometimes use functions or UDFs in CTE's that I expect to be called, or at least wish to know whether I CAN expect it to be called, and perhaps force it if needed.

(4) By Richard Hipp (drh) on 2021-02-15 15:23:47 in reply to 2 [link] [source]

Where can I download the source

Use on of the "Download" links near the top of https://sqlite.org/src/info/with-generated-as.

(5) By Richard Hipp (drh) on 2021-02-15 15:27:16 in reply to 3.1 [link] [source]

In what circumstance might the CTE not be evaluated at all?

In the following query, the "cte0" common table expression is never evaluated:

    WITH cte0(a) AS (VALUES(1))
    SELECT 2;

(6) By Richard Hipp (drh) on 2021-02-15 15:29:45 in reply to 3.1 [link] [source]

That would also imply that I can force the materialization of a CTE by making it be the first table after the FROM keyword and also GENERATED.

No. The query planner can still choose to implement the CTE using a co-routine.

The current prototype implementation does always materialize the CTE, but I'm working on "fixing" that now. The GENERATED keyword prevents the CTE from being evaluated more than once, but it does not necessarily materialize the CTE.

(7) By Richard Hipp (drh) on 2021-02-15 21:22:23 in reply to 1.1 [link] [source]

I now learn that PostgreSQL added the following syntax with their version 12:

WITH RECURSIVE cte-table-name AS MATERIALIZED ( select-stmt ) ,
     linerad = 10px
     linewid *= 0.5 
     $h = 0.21
     # debug_label_color=red; color=lightgray; margin=1cm

C0:  circle radius 10%
     arrow 2*arrowht
     oval "WITH" fit
     arrow right linerad then up $h then right linerad+arrowht
     oval "RECURSIVE" fit
     line right linerad then down even with WITH then right linerad
     arrow linerad+arrowht
CTN: box "cte-table-name" fit
     arrow 2*arrowht
     oval "AS" fit
     arrow right linerad then up 0.9*$h then right linerad+arrowht color red
     oval "MATERIALIZED" fit color red
     arrow right linerad then down even with CTN then right linerad+arrowht \
        color red
LP:  oval "(" bold fit
     arrow 2*arrowht
     box "select-stmt" fit
     arrow same
RP:  oval ")" bold fit
     arrow
     circle same

     arrow from AS.e then right even with MATERIALIZED
     line right to LP.w

C1:  oval "," bold fit at (.5<CTN.w,RP.e>,1.25*$h below AS)
     arrow from RP.e right linerad then down even with C1 then to C1.e
     line from C1.w left even with linerad+arrowht west of CTN.w \
        then up even with CTN then right linerad

     line from WITH.e to arrowht west of CTN.w

That means I'll probably go with the PG syntax, which means adding a new "MATERIALIZED" keyword. PG-12+ also allows for "NOT MATERIALIZED" which I will also consider supporting.

(8) By Ryan Smith (cuz) on 2021-02-15 21:51:22 in reply to 7 [link] [source]

Would NOT MATERIALIZED imply the same as the absence of MATERIALIZED or perhaps force the CTE to never be materialized and re-evaluate on every reference?

The latter might be useful for function-centric queries (as opposed to data-centric queries) in some way. Honestly though, I am having trouble seeing a real utility for it, but always looking to learn.

Maybe someone here in the know can offer some insight on why PostGres has this and where it is useful?

(9) By Ryan Smith (cuz) on 2021-02-15 22:11:44 in reply to 8 [link] [source]

After reading the documentation for Postgres' CTE implementation, it is clear to me that the actual function of NOT MATERIALIZED there is not to force recalculation always, but in fact to indicate to the Query Planner that it should use optimizing features for the CTE (evaluate only the needed rows) for every reference to it inside the main query, because in Postgres any CTE with multiple references in the main query is materialized.

I think this folded in optimization might be SQLite's default behavior even for multiple CTE references, unless the CTE is ordered, in which case materialization would still be the most optimized way to go (and be allowed notwithstanding the NOT MATERIALIZED hint).

(10) By Keith Medcalf (kmedcalf) on 2021-02-15 22:13:47 in reply to 7 [link] [source]

Do you intend "MATERIALIZED" to be more strict than "GENERATED"?

Personally, I think that there is a difference between the requirements given the plain meaning of the words used.

MATERIALIZED would imply that the cte must be "materialized" as spoken (ie, is an optimization barrier and requires that the cte be executed only once, whether the result is a materialized table or a co-routine is irrelevant -- perhaps a new type of co-routine which "caches" already generated rows as in a half-co-routine half-materialized beast might work).

Conversely, in the real world I would more think that the real-world common requirement is best described as "GENERATED [ONCE]" which would mean that the cte is required to be generated only once (whether materialized or by co-routine or by a new cached co-routine beast) and that optimization from the outer query into the CTE (like where clause pushdown) should be allowed.

(11) By Keith Medcalf (kmedcalf) on 2021-02-15 22:32:08 in reply to 10 [source]

To be clear, that means that the unspoken default would be GENERATED which would behave as CTE's do now.

GENERATED ONCE should allow all optimizations including query flattening and pushdown optimizations provided that the CTE is only ever executed ONCE.

MATERIALIZED would require that the CTE be executed only once as spoken.

So the hierarchy would be:

GENERATED
GENERATED ONCE
MATERIALIZED

(12) By Keith Medcalf (kmedcalf) on 2021-02-15 22:34:31 in reply to 11 [link] [source]

Perhaps ALWAYS can be used instead of ONCE ...

(13.1) By Richard Hipp (drh) on 2021-02-22 14:42:07 edited from 13.0 in reply to 1.1 [link] [source]

A new revision to this idea is syntax as follows:

WITH RECURSIVE cte-table-name AS NOT MATERIALIZED ( select-stmt ) MATERIALIZED ,
     linerad = 10px
     linewid *= 0.5 
     $h = 0.21
     # debug_label_color=red; color=lightgray; margin=1cm

C0:  circle radius 10%
     arrow 2*arrowht
     oval "WITH" fit
     arrow right linerad then up $h then right linerad+arrowht
     oval "RECURSIVE" fit
     line right linerad then down even with WITH then right linerad
     arrow linerad+arrowht
CTN: box "cte-table-name" fit
     arrow 2*arrowht
     oval "AS" fit
     arrow right linerad then up 2.25*$h then right linerad+arrowht color red
     oval "NOT" fit color red
     arrow 1.3*arrowht color red
M1:  oval "MATERIALIZED" fit color red
     arrow right linerad then down even with CTN then right linerad+arrowht color red
LP:  oval "(" bold fit
     arrow 2*arrowht
     box "select-stmt" fit
     arrow same
RP:  oval ")" bold fit
     arrow
     circle same

     arrow from AS.e then right linerad then up 0.95*$h then right linerad+arrowht color red
     oval "MATERIALIZED" fit color red
     arrow right even with M1.e color red
     line right linerad then down linerad color red

     arrow from AS.e then right even with MATERIALIZED
     line right to LP.w

C1:  oval "," bold fit at (.5<CTN.w,RP.e>,1.25*$h below AS)
     arrow from RP.e right linerad then down even with C1 then to C1.e
     line from C1.w left even with linerad+arrowht west of CTN.w \
        then up even with CTN then right linerad

     line from WITH.e to arrowht west of CTN.w

This is the syntax of PostgreSQL 12 and later. The semantics are as follows:

  1. If the MATERIALIZED keyword is present, then the push-down optimization is disabled and the CTE is materialized into an ephemeral table. The ephemerial table is then used for each occurrence of the "cte-table-name".

  2. If the NOT MATERIALIZED phrase is present, then each occurrence of the "cte-table-name" in the main query is replaced by a sub-query that is the "select-stmt". Various optimizations such as flattening and push-down may come into play. The query planner might still choose to materialize the subquery, for example if the subquery occurs in a context that requires it to be scanned multiple times. So "NOT MATERIALIZED" is not quite the correct meaning here. Really, this should be "AS IF THIS WHERE AN ORDINARY VIEW OR SUBQUERY". However, the "NOT MATERIALIZED" syntax is what PostgreSQL uses, so that is what SQLite should use.

  3. If neither hint is present, then the behavior is as MATERIALIZED if the CTE is used more than once, and as NOT MATERIALIZED if the CTE is used exactly once. This is different from legacy in which the behavior was always as NOT MATERIALIZED.

I believe that this is the same behavior as PostgreSQL. I have reached out to the PostgreSQL devs for confirmation, but have not yet heard back from them.