Virtual Table Joins with Repeated calls to xFilter
(1) By Alex Rothberg (agrothberg) on 2022-04-06 16:01:41 [source]
I have two virtual tables, O
and C
and I am attempting to join them as such:
select
o.[abc]
from
O o join C c
on
c.[id] = o.[c_id]
I am observing poor performance where sqlite repeatedly calls xFilter
(with no constraints used) on one of the two tables for each row in the other. Note, I have built my xBestIndex
to indicate that there is no advantage to using an index, i.e. I do want to perform the full table (sequential) scan. That being said, I would expect that sqlite would call the xFilter
for each table once and then perform the repeated full table scans in memory rather than re-calling xFilter
each time it needs to scan it. Or at the very least that there would be some option to tell sqlite that the fixed cost of calling xFilter
is expensive relative manipulating the data in memory on the sqlite side.
Am I doing something wrong here / is there a way to address the performance issue?
(2) By Gunter Hick (gunter_hick) on 2022-04-07 05:27:47 in reply to 1 [link] [source]
There should be 2 calls each to xBestIndex, one for full table scan (no constraints) and one for the join constraint. I assume "no advantage to using an index" to mean that the cost you return is the same with out without constraint. This means SQLite is going to performa full table scan of one table (lets assume o) in the outer loop and a full table scan of the other table in the inner loop. Calling xFilter on the inner table is the required method of resetting the cursor on the inner table. You are not doing anything wrong, except forcing SQLite into a slow query plan by providing only slow access to the data and then complaining about it. Making table c accessible by what already looks like a record number would be one way. If c.id is actually a record number in the external store, there should be a way to retrieve a record quickly. You would have to check for the (id,EQ) constraint in your xBestIndex function and use the xFilter argument to retrieve exactly that record in xFilter. Another would be (since you are not using any fields of c other than id) to reformulate as SELECT abc FROM O where c_id IN (SELECT id FROM C); which would allow SQLite to determine that only one scan of C is required- Another would be to specify a materialised CTE holding just the c.id field in either of the JOIN vs IN SUBSELECT variants.
(3) By Alex Rothberg (agrothberg) on 2022-04-07 15:03:55 in reply to 2 [link] [source]
There should be 2 calls each to xBestIndex, one for full table scan (no constraints) and one for the join constraint.
Yep, I see two calls to xBestIndex
per table, one for each table w/ no constraints and once per table with the join constraint.
I assume "no advantage to using an index" to mean that the cost you return is the same with out without constraint.
Correct
You are not doing anything wrong, except forcing SQLite into a slow query plan by providing only slow access to the data and then complaining about it.
Well, yes / no. I actually want sqlite to perform a table scan rather than attempting row by row access. The backing store is an API call so the cost of fetching one row is almost the same as fetching the entire table.
Calling xFilter on the inner table is the required method of resetting the cursor on the inner table.
So this might be the root of my problem / misunderstanding. As the engine advances one row over the outer loop / table (in your example table o
) and has to re-scan the inner table, table c
in this case, I would like to avoid having the engine once again call xFilter
because, as mentioned above, that will result in another call to the API. I could add my own caching layer but thought that might be handled automatically by sqlite and held just for the lifetime of the query.
Making table c accessible by what already looks like a record number would be one way. If c.id is actually a record number in the external store, there should be a way to retrieve a record quickly. You would have to check for the (id,EQ) constraint in your xBestIndex function and use the xFilter argument to retrieve exactly that record in xFilter.
Yes I already have this exact logic ;-) but as mentioned the data fetch cost of one fetching N rows is almost N times worse than just fetching the full table once.
Another would be (since you are not using any fields of c other than id) to reformulate as SELECT abc FROM O where c_id IN (SELECT id FROM C); which would allow SQLite to determine that only one scan of C is required-
Another would be to specify a materialised CTE holding just the c.id field in either of the JOIN vs IN SUBSELECT variants.
I tried to modify the query as:
with CC as materialized (select * from C), OO as materialized (select * from O)
select
o.[abc]
from
OO o join CC c
on
c.[id] = o.[c_id]
but that did not seem to change the issue I was seeing.
I guess my question is is there anyway when returning the estimatedCost
or setting up the Virtual Table to indicate to sqlite that for the lifetime of the query, the virtual table should be "materialized"?
(4) By SeverKetor on 2022-04-07 15:17:52 in reply to 3 [link] [source]
There was actually just recently some posts about materializing CTEs. Try something like with CC as materialized (select * from C order by +id), OO as materialized (select * from O order by +c_id)
(the "+" might be overkill though). I'll make no promises but it might work how you want it to.
(5) By Gunter Hick (gunter_hick) on 2022-04-07 15:34:19 in reply to 3 [link] [source]
As I already mentioned, since you are only checking if the RHS table row exists, use SELECT abc FROM o where c_id IN (SELECT id from C); this should allow SQLite to perform the subquery exactly once.
(6) By Alex Rothberg (agrothberg) on 2022-04-07 16:07:30 in reply to 5 [link] [source]
I appreciate the guidance on how to reformulate the join for this specific circumstance. That being said, this was a minimum example that I had stripped down for the purpose of identifying the issue and then posting here. Ideally I can find a solution that is less fragile to the formulation of the query. Further, there are some limitations with the sub-select vs the join, such as not being able to pull any fields off the RHS table.
All that being said, the sub-select actually doesn't "work" (unless I tinker with the estimatedCost
s). It looks like sqlite makes one call to the xFilter
of C
but then N
calls xFilter
for O
(using the join constraint). This happens if I set the estimatedCost
of w/ and w/o using join constraint to be the same or if I set the estimatedCost
of w/ to be lower. This all "works" (where works is just two total calls to xFilter
, one for each table), if I set the estimatedCost
for the non constraint to be lower than w/ a constraint. Perhaps this is a lack of my understanding of this all works, but I would think the query planner would do something like and then choose which of these options has the lowest cost:
total cost filtering w/ sqlite = (estimatedCost w/o constraint) + (cost to filter in memory)
total cost of filtering w/ virtual table = estimatedRowCount * (estimateCost w/ constraint)
however that doesn't seem to be the case here.
(7) By Bjoern Hoehrmann (bjoern) on 2022-04-08 17:57:00 in reply to 1 [link] [source]
An option to tell SQLite to materialize the whole virtual table (keyed on the used constraints) would be great.
I have yet to check whether the new sqlite3_vtab_in
family of functions can be used here. It's supposed to give you all the values for a constraint at once, but I do not know whether it would process suitable JOIN
s as IN
constraints.
You could of course create a table with all the data in an in-memory database and proxy from there.
(8) By anonymous on 2022-04-09 00:21:41 in reply to 7 [link] [source]
An option to tell SQLite to materialize the whole virtual table (keyed on the used constraints) would be great.
Yes, possibly to add SQLITE_INDEX_SCAN_MATERIALIZED
to indicate that it should do that.
You could of course create a table with all the data in an in-memory database
Something else I had done in one program is that the xOpen method will create a temporary table with the same name as the virtual table, and will then fill in the data and then returns SQLITE_SCHEMA
, and then it will access that temporary table instead. If the data is invalid, it will drop the table so that it will automatically be reloaded when it is needed.
(9) By Alex Rothberg (agrothberg) on 2022-04-18 20:07:43 in reply to 8 [link] [source]
An option to tell SQLite to materialize the whole virtual table (keyed on the used constraints) would be great.
Yes, possibly to add SQLITE_INDEX_SCAN_MATERIALIZED to indicate that it should do that.
What would be the lifetime of this materialized table? Would it just be for the lifetime of the query or would it last beyond the query? The former is good enough for my use case.
I am new to these forums. How would I go about filing a feature request for this?
(10) By anonymous on 2022-04-20 01:06:52 in reply to 9 [link] [source]
What would be the lifetime of this materialized table? Would it just be for the lifetime of the query or would it last beyond the query? The former is good enough for my use case.
I should think that it should be for the lifetime of the query only (from when it is first stepped until it is reset); lasting beyond that is sometimes inappropriate anyways.
How would I go about filing a feature request for this?
Unfortunately, I do not really know.
(11) By Bjoern Hoehrmann (bjoern) on 2023-03-02 15:53:48 in reply to 1 [link] [source]
Another option here might be to return SQLITE_CONSTRAINT from xBestIndex whenever it asks for an estimate with usable constraints. That would assume SQLite will always try a plan without constraints (and the documentation does not promise that), but if it does, it would have to pick the best plan among those where it does not filter the virtual table (which might be different from picking a plan where filtering the virtual table seems possible).