SQLite Forum

Handling of DISTINCT in virtual tables?
Login

Handling of DISTINCT in virtual tables?

(1) By Joshua Randall (jrandall) on 2020-07-27 22:32:54 [source]

There does not appear to be a way for a virtual table to know that it is handling a DISTINCT query so that it can avoid returning duplicate entries, unless I have missed something in the interface. Of course, DISTINCT queries do work against virtual tables, but as far as I know, the virtual table has to generate all rows and the filtering of duplicate rows is handled by the sqlite core.

We have found that our virtual table implementations sometimes have a backing index available such that when the index covers all selected columns a DISTINCT query could easily be answered using an index-only scan. However, since there is no interface for the virtual table to find out that the query is for DISTINCT rows, we cannot optimise for this case and instead the virtual table implementation must generate many duplicate rows which are then subsequently deduplicated by the core.

In an extreme version of this, we have some virtual tables in which the set of valid values for a column are intrinsic to the vtable implementation itself (i.e. they are essentially hard-coded as enums that the implementation supports).

For example, we have defined a structured ID that is encoded as a 64-bit integer in which the high three digits represent a type ID, the middle 10 digits encode some type-specific meaning, and the low 6 digits are a generic ID. The type-encodings have canonical string representations such as "TypeA", "TypeB", etc, and we have implemented an id_mapping virtual table that can provide mappings from the IDs to its various constituent parts (and vice-versa). It has a column for the 64-bit ID, a column for the type as an int, a column for type_name as text, a column for the type-specific part, a column for the generic ID, and a column for the generic ID name which is a concatenation of the type name and the generic ID (e.g. "TypeA123456"). The id_mapping vtable theoretically has 2^63 rows, so a full table scan is not practical as it would take on the order of a century to run on today's hardware. This mapping table is nonetheless useful when constraints are set for the ID (in which case all other fields can be populated from it) or when enough of the other fields are set that a reasonably sized set of possible IDs can be generated from it. However, it is essentially impossible for this interface to support queries to indicate what types it supports as values for the type name. I wish that we could simply do SELECT DISTINCT type_name FROM id_mapping and get back ["TypeA", "TypeB"] but as it currently stands we expect that query should take at least 7 months to complete (as it has to iterate through all 2*10^16 currently valid rows, although this will get worse as support for additional types is added).

If sqlite core could indicate to our virtual table's xBestIndex method that distinct results for a set of columns are being requested, we could answer that query in nanoseconds by simply iterating over our enum values.

If I have missed some way of doing this, it would be great to know what it is!

Otherwise, might it be considered as a possible extension to the sqlite3_index_info interface to xBestIndex? It may be that all that would be required from the virtual table point of view would be some flag to indicate that the query is DISTINCT instead of ALL, if the colUsed mask can be relied on to indicate what columns are selected (its documentation says that it indicates which columns "may be required" which I guess may be to avoid dealing with complex SELECT expressions such as CASE..WHEN?).

It would probably be difficult to get the sqlite query engine to pass distinct to a virtual table in every complicated instance (e.g. involving joins, conditionals, etc) where it should be safe to do so, but it seems like it ought to be relatively straightforward to identify simple situations when it can definitely be safely passed (such as when all of the columns selected in a query are from the same virtual table.

(2) By anonymous on 2020-07-28 02:17:58 in reply to 1 [link] [source]

I also think that the virtual table mechanism could be greatly improved. Some of my things I wanted include:

  • Being able to consume LIMIT/OFFSET clauses (this is not always possible; the SQLite core will have to check if it is indeed the case)

  • Indexing on expressions, including use of non-deterministic functions (e.g. it may be able to implement ORDER BY RANDOM() efficiently)

  • "True" and "false" constraint types for virtual tables; for example, if you write WHERE X then the constraint type for the column X is the "true" constraint.

  • Support for batch updates; among other things, this can be used for implementing the truncate optimization for virtual tables

  • The possibility to configure a virtual table to minimize the colUsed values (although this would not be the default setting; the default setting would be what it currently does): if all uses are consumable (which implies that the actual value in that column isn't a result of the query, although just because that is the case doesn't necessarily mean all of its uses are consumable), then that bit will not be set. As an example, if a column name appears not in the result set but only in the ORDER BY clause, then it is consumable; if it appears only in a constraint like X = 1 then it is also consumable (but a constraint like X = Y isn't consumable, although if some of the stuff mentioned above would be implemented, then this also might be consumable in some cases)

  • Use declared UNIQUE constraints to optimize the query

  • A less klugy way to detect if the query is interrupted while the virtual table implementation (or possibly a user function) is executing, which might take a long time. This is easily enough to do in an application program, but if the function or virtual table comes from an extension, then I cannot figure out a way to do it other than preparing a query involving a recursive CTE and stepping it and checking the error code to see if it has been interrupted.

One thing I suggest is a flag input (there is already a flag output in the index info structure). Note that in this case of DISTINCT (and in other cases too), there may be cases where the SQLite core would not be able to effectively pass this to the virtual table implementation, so it has to keep track of that.