Documentation for where query optimizations & relation algebra occurs?

(1) By MJS (mjsmjs) on 2020-10-07 03:59:30 [link] [source]

I'm building a two-tier query engine system using SQLite. In this system I have a layer that receives the SQL query and does a naive pattern match on it and if it recognizes it, the query is handled by my layer, and if it does not, it forwards the SQL to SQLite. The underlying storage is a virtual table, not a real database. It is a file format for a system I'm working on.

I'm doing this both for fun and as a side-project so I'm trying to learn techniques as well.

For example, this query is something I recently handled at my layer and don't forward to SQLite:

SELECT SUM(col1), col2 FROM Table WHERE col3 = 100 AND col4 = 200 GROUP BY col2.

The code I generate for the above query is that I recognize (very hard-coded!) that col3 and col4 are different columns, I scan them in parallel to generate two lists of rowids. I then intersect these two lists and they are my final rowis I need to extract the value of col2 for and then group by them.

I do seem to beat SQLite, but this is just one query and I've spent a ton of time on it. And frankly, I'm not sure if this is the best way, perhaps there is a better way to optimize this query.

Therefore what I'm looking for is from SQLite that given all costs are equal (which probably is a naive idea, but humor me ...) can I ask it to give me the plans opcodes then I can see what I can generate code for?

If you're asking why I'm doing this? Education ... but also I believe the virtual table infrastructure is too inefficient (i.e. has too many callbacks) if you have to scan the entire table which is often the case in a group by query.

P.S. - In addition to everything I said, I would love if there was a query planner tool (even if it was outside SQLite) that could tell me all the possible relation alegbra ways a query could be accomplished.

(2) By Clemens Ladisch (cladisch) on 2020-10-07 14:28:09 in reply to 1 [source]

The output of the EXPLAIN command gives you the opcodes.

These internals can (and do) change in any new SQLite version, so this is probably not very useful.

(3) By Gunter Hick (gunter_hick) on 2020-10-09 08:26:02 in reply to 1 [link] [source]

My hunch is that you should be doing the analysis inside of your virtual table implementation.

The xBestIndex method will be called at least twice

-) Without usable constraints (i.e. full table scan). Return the number of rows in the table.

-) With the constraints ((col3, EQ), (col4, EQ)). Return the expected number of rows that will match both constraints. E.g. if there are 10 discrete values for each column, return number of rows divided by 100.

You need to encode whatever xBestIndex comes up with in the index number nd index string return values, and in a way that your xFilter function understands. You also need to tell SQLite, which constraints are used, their order in the argument list, and if you can guarantee them (so SQLite can omit the value check)

The xFilter method will be called with the index number and index string belonging to the query plan that SQLite selected for execution, plus the comparison values.

This is where you do the magic of obtaining a "short list" of the matching records which SQLite steps through by calling the xNext method.

This way, you are letting SQLite do the heavy lifting of parsing SQL and creating a query plan, while you can concentrate on getting 
the smallest number of rows to operate on in the fastest possible way.

BTW: For the query you show, xBestIndex arguments will include the request to ORDER BY col2 to discern between keeping a single running total per group (ordered presentation), or building a temporary btree (unordered presentation).