Virtual Table Joins with Repeated calls to xFilter

(1) By Alex Rothberg (agrothberg) on 2022-04-06 16:01:41 [source]

I have two virtual tables, O and C and I am attempting to join them as such:

select 
    o.[abc]
from 
    O o join C c
on
    c.[id] = o.[c_id]

I am observing poor performance where sqlite repeatedly calls xFilter (with no constraints used) on one of the two tables for each row in the other. Note, I have built my xBestIndex to indicate that there is no advantage to using an index, i.e. I do want to perform the full table (sequential) scan. That being said, I would expect that sqlite would call the xFilter for each table once and then perform the repeated full table scans in memory rather than re-calling xFilter each time it needs to scan it. Or at the very least that there would be some option to tell sqlite that the fixed cost of calling xFilter is expensive relative manipulating the data in memory on the sqlite side.

Am I doing something wrong here / is there a way to address the performance issue?

(2) By Gunter Hick (gunter_hick) on 2022-04-07 05:27:47 in reply to 1 [link] [source]

There should be 2 calls each to xBestIndex, one for full table scan (no constraints) and one for the join constraint. I assume "no advantage to using an index" to mean that the cost you return is the same with out without constraint.

This means SQLite is going to performa full table scan of one table (lets assume o) in the outer loop and a full table scan of the other table in the inner loop. Calling xFilter on the inner table is the required method of resetting the cursor on the inner table.

You are not doing anything wrong, except forcing SQLite into a slow query plan by providing only slow access to the data and then complaining about it.

Making table c accessible by what already looks like a record number would be one way. If c.id is actually a record number in the external store, there should be a way to retrieve a record quickly. You would have to check for the (id,EQ) constraint in your xBestIndex function and use the xFilter argument to retrieve exactly that record in xFilter.

Another would be (since you are not using any fields of c other than id) to reformulate as SELECT abc FROM O where c_id IN (SELECT id FROM C); which would allow SQLite to determine that only one scan of C is required-

Another would be to specify a materialised CTE holding just the c.id field in either of the JOIN vs IN SUBSELECT variants.

(3) By Alex Rothberg (agrothberg) on 2022-04-07 15:03:55 in reply to 2 [link] [source]

There should be 2 calls each to xBestIndex, one for full table scan (no constraints) and one for the join constraint.

Yep, I see two calls to xBestIndex per table, one for each table w/ no constraints and once per table with the join constraint.

I assume "no advantage to using an index" to mean that the cost you return is the same with out without constraint.

Correct

You are not doing anything wrong, except forcing SQLite into a slow query plan by providing only slow access to the data and then complaining about it.

Well, yes / no. I actually want sqlite to perform a table scan rather than attempting row by row access. The backing store is an API call so the cost of fetching one row is almost the same as fetching the entire table.

Calling xFilter on the inner table is the required method of resetting the cursor on the inner table.

So this might be the root of my problem / misunderstanding. As the engine advances one row over the outer loop / table (in your example table o) and has to re-scan the inner table, table c in this case, I would like to avoid having the engine once again call xFilter because, as mentioned above, that will result in another call to the API. I could add my own caching layer but thought that might be handled automatically by sqlite and held just for the lifetime of the query.

Making table c accessible by what already looks like a record number would be one way. If c.id is actually a record number in the external store, there should be a way to retrieve a record quickly. You would have to check for the (id,EQ) constraint in your xBestIndex function and use the xFilter argument to retrieve exactly that record in xFilter.

Yes I already have this exact logic ;-) but as mentioned the data fetch cost of one fetching N rows is almost N times worse than just fetching the full table once.

Another would be (since you are not using any fields of c other than id) to reformulate as SELECT abc FROM O where c_id IN (SELECT id FROM C); which would allow SQLite to determine that only one scan of C is required-

Another would be to specify a materialised CTE holding just the c.id field in either of the JOIN vs IN SUBSELECT variants.

I tried to modify the query as:

with CC as materialized (select * from C), OO as materialized (select * from O)

select 
    o.[abc]
from 
    OO o join CC c
on
    c.[id] = o.[c_id]

but that did not seem to change the issue I was seeing.

I guess my question is is there anyway when returning the estimatedCost or setting up the Virtual Table to indicate to sqlite that for the lifetime of the query, the virtual table should be "materialized"?

(4) By SeverKetor on 2022-04-07 15:17:52 in reply to 3 [link] [source]

There was actually just recently some posts about materializing CTEs. Try something like with CC as materialized (select * from C order by +id), OO as materialized (select * from O order by +c_id) (the "+" might be overkill though). I'll make no promises but it might work how you want it to.

(5) By Gunter Hick (gunter_hick) on 2022-04-07 15:34:19 in reply to 3 [link] [source]

As I already mentioned, since you are only checking if the RHS table row exists, use

SELECT abc FROM o where c_id IN (SELECT id from C);

this should allow SQLite to perform the subquery exactly once.

(6) By Alex Rothberg (agrothberg) on 2022-04-07 16:07:30 in reply to 5 [link] [source]

I appreciate the guidance on how to reformulate the join for this specific circumstance. That being said, this was a minimum example that I had stripped down for the purpose of identifying the issue and then posting here. Ideally I can find a solution that is less fragile to the formulation of the query. Further, there are some limitations with the sub-select vs the join, such as not being able to pull any fields off the RHS table.

All that being said, the sub-select actually doesn't "work" (unless I tinker with the estimatedCosts). It looks like sqlite makes one call to the xFilter of C but then N calls xFilter for O (using the join constraint). This happens if I set the estimatedCost of w/ and w/o using join constraint to be the same or if I set the estimatedCost of w/ to be lower. This all "works" (where works is just two total calls to xFilter, one for each table), if I set the estimatedCost for the non constraint to be lower than w/ a constraint. Perhaps this is a lack of my understanding of this all works, but I would think the query planner would do something like and then choose which of these options has the lowest cost:

total cost filtering w/ sqlite = (estimatedCost w/o constraint) + (cost to filter in memory)
total cost of filtering w/ virtual table = estimatedRowCount * (estimateCost w/ constraint)

however that doesn't seem to be the case here.

(12) By Clemens (clemens) on 2024-02-22 19:32:27 in reply to 2 [link] [source]

I have a very similar problem for the same join query. In my case sqlite does two xBestIndex calls:

a) with a single unusable EQ constraint (for fetching the full table)
b) with a single usable EQ constraint (for fetching a single row from the join)

For case b) it would be nice to have an option to tell sqlite that I can provide a batch of rows for the join. For example, setting a batchSize = 500 value in the sqlite3_index_info struct would mean that I can provide 500 rows. In the filter method I then get an arg array with up to 500 ids that need to be fetched...

(13) By Gunter Hick (gunter_hick) on 2024-02-23 06:19:34 in reply to 12 [link] [source]

Did you read https://sqlite.org/c3ref/vtab_in.html and https://sqlite.org/c3ref/vtab_in_first.html ?
Does that not do exactly what you want?

(14) By Clemens (clemens) on 2024-02-26 23:22:48 in reply to 13 [link] [source]

Thank you that look good! Unfortunately, sqlite3_vtab_in(.., .., -1) always returns false for my simple join query. Is it supposed to work for joins? or does in only works for actual IN operators?

(15) By Gunter Hick (gunter_hick) on 2024-02-27 06:31:08 in reply to 14 [link] [source]

From the language in the docs you would have to reformulate your query to actually use IN (<subquery>); then SQLite can determine if it can produce the subquery all at once, which is a prerequisite for bulk IN processing.

(16) By Clemens (clemens) on 2024-02-28 02:20:14 in reply to 15 [link] [source]

Thank you for the clarification. However, as Alex Rothberg mentioned as well, I don't think I can assume that the user will use an IN query instead of using a JOIN...

(17) By Gunter Hick (gunter_hick) on 2024-02-28 06:37:51 in reply to 16 [link] [source]

Another possibility would be for your virtual table to cache the contents from the backing store. Use the xOpen call to read the backing store, discard cache on xClose. This allows you to handle JOIN too.

(18) By Clemens (clemens) on 2024-02-29 01:22:24 in reply to 17 [link] [source]

A bit more regarding my use-case: The GraphQL query language is sometimes a bit limited as a query language and I had the (fun) idea to query GraphQL data through SQLite so that I can do more complex queries and bring data into the right shape using SQL.

With this context, caching the dataset is something I like to avoid if possible because I don't want to always fetch the whole dataset over the network. Similarly, fetching rows one by one also takes too much time. Ideally I want to fetch all required rows in a single request.

Besides that I'm probably abusing SQLite vtables here. Is there a technical reason why there couldn't be a similar interface for JOIN as there is for IN expressions?

(19) By Gunter Hick (gunter_hick) on 2024-02-29 06:45:15 in reply to 18 [link] [source]

With IN (<list>) or IN (<subselect>) or even IN (<cte>), it is possible to generate the set of constraint tuples in advance; thus SQLite can provide the set for IN processing.

With JOIN, the set of constraint tuples for the RHS table (processed in the inner loop) is generally produced one after each; this is much more efficient for queries that do not require the full result set and does not slow down queries that do.

We have found here that querying Oracle over the network is much faster if we specify batch transfer (e.g. 1000 records at a time) instead of single records because of network round trip delays.

Virtual tables are designed for integration of non-native data stores into SQLite; there is no "abuse" in interfacing with GraphQL.

LAN speeds are now in the Gigabit range; if you are still concerned about saturating the network, then maybe fetch a bunch of records from GraphQL and hope for many cache hits, fetching a bunch more on cache misses. You would have to determine the optimum batch size and caching strategy depending on your workload.

If you expect to hit most the records at most once, then cache hits should be expunged; if you expect an unknown subset of records to be hit multiple times, an LRU algorithm may be more appropriate.

(7) By Bjoern Hoehrmann (bjoern) on 2022-04-08 17:57:00 in reply to 1 [link] [source]

An option to tell SQLite to materialize the whole virtual table (keyed on the used constraints) would be great.

I have yet to check whether the new sqlite3_vtab_in family of functions can be used here. It's supposed to give you all the values for a constraint at once, but I do not know whether it would process suitable JOINs as IN constraints.

You could of course create a table with all the data in an in-memory database and proxy from there.

(8) By anonymous on 2022-04-09 00:21:41 in reply to 7 [link] [source]

An option to tell SQLite to materialize the whole virtual table (keyed on the used constraints) would be great.

Yes, possibly to add SQLITE_INDEX_SCAN_MATERIALIZED to indicate that it should do that.

You could of course create a table with all the data in an in-memory database

Something else I had done in one program is that the xOpen method will create a temporary table with the same name as the virtual table, and will then fill in the data and then returns SQLITE_SCHEMA, and then it will access that temporary table instead. If the data is invalid, it will drop the table so that it will automatically be reloaded when it is needed.

(9) By Alex Rothberg (agrothberg) on 2022-04-18 20:07:43 in reply to 8 [link] [source]

An option to tell SQLite to materialize the whole virtual table (keyed on the used constraints) would be great.

Yes, possibly to add SQLITE_INDEX_SCAN_MATERIALIZED to indicate that it should do that.

What would be the lifetime of this materialized table? Would it just be for the lifetime of the query or would it last beyond the query? The former is good enough for my use case.

I am new to these forums. How would I go about filing a feature request for this?

(10) By anonymous on 2022-04-20 01:06:52 in reply to 9 [link] [source]

What would be the lifetime of this materialized table? Would it just be for the lifetime of the query or would it last beyond the query? The former is good enough for my use case.

I should think that it should be for the lifetime of the query only (from when it is first stepped until it is reset); lasting beyond that is sometimes inappropriate anyways.

How would I go about filing a feature request for this?

Unfortunately, I do not really know.

(11) By Bjoern Hoehrmann (bjoern) on 2023-03-02 15:53:48 in reply to 1 [link] [source]

Another option here might be to return SQLITE_CONSTRAINT from xBestIndex whenever it asks for an estimate with usable constraints. That would assume SQLite will always try a plan without constraints (and the documentation does not promise that), but if it does, it would have to pick the best plan among those where it does not filter the virtual table (which might be different from picking a plan where filtering the virtual table seems possible).