Indexes: on FTS, compound or not, EAV/CR tables & trees

(1) By Vadim Goncharov (nuclight) on 2022-05-17 22:47:24 [source]

Continuing my prior post with thinking about CBOR and trees. I was investigating alternative approach of storing trees, with pure SQL and as less JSON/CBOR as possible, and came to following variants of storing multiple versions of JSON-like objects and MongoDB-like queries: https://github.com/nuclight/tkxgram/blob/master/docs/db.md which has example of objects to store¹. On thoughts and sample tables there, I have following questions:

Indexes on FTS tables. Hash indexes.

I suppose that I will need keeping text values not in leaf_values table, but in something like:
```
CREATE VIRTUAL TABLE alltexts USING fts4(rowid INTEGER PRIMARY KEY, SipHash INTEGER content TEXT);
CREATE INDEX texts_hash ON alltexts(SipHash);
```
so that I would keep only unique copies of each text, and I will check that by calculating hash of text - if it is different, I immediately know if I may continue INSERT, and need to check only dozens of rows instead of millions in case of collision.

Why hash? Because UNIQUE could be implemented by B-Tree only, and will mean unnecessary copy of each string, probably very long, while I want to gain as much space savings as possible (it's archive data), also utilizing compress.

But there is problem, documentation states:
1. As with all virtual table types, it is not possible to create indices or triggers attached to FTS tables. Nor is it possible to use the ALTER TABLE command to add extra columns to FTS tables (although it is possible to use ALTER TABLE to rename an FTS table).
2. Data-types specified as part of the "CREATE VIRTUAL TABLE" statement used to create an FTS table are ignored completely. Instead of the normal rules for applying type affinity to inserted values, all values inserted into FTS table columns (except the special rowid column) are converted to type TEXT before being stored.
So, how should I create such table and index (BTW, any plans for hash indexes in SQLite?), given my leaf_values table definition, probably with separate table for texts?
Index B-tree depth (disk format) and FTS token lengths.

My idea of paths inside objects was to save them as text strings which will be parsed on retrieval, and to search on them using FTS, again. Now I have dilemma, save paths directly as e.g.

message media photo

or as

aa bb cc

where these are base32-encoded identifier numbers from Attribute_name table, loosing simplicity, but these are shorter & less problems with tokenizer.

From SQLite on-disk format document, I see that keys (strings here) are placed in intermediate nodes, and thus longer strings would probably lead to less fan-out factor, giving more large B-tree -> slower.. So may be we can have conclusion "short strings are better" - but how many? Is it really worth that in practice?

Also, the FTS3 doc says that

merge... command ... causes SQLite to do a limited amount of work toward merging the various inverted index b-trees of an FTS3/4 table together into one large b-tree

which contradicts with logic "B-tree should be smaller & have less levels to work faster".

Who is right here / how it really works / what variant will be better for this "path tokens" task?
Compound indexes. Many cross self-joins.

See that I need EAV/CR model and, in particular, I will need queries against one large Leaf_values table of nature "find all objects WHERE attrib1 = value1 AND attrib2 = value2 AND attr3 > val3 ..." when there are just two columns (except ID and service) attribute and value - that is, the relational division operation. As far as I understand, this could be achieved by (dynamically created) SQL query where this table is cross-joined with itself multiple times (3 in example above), each turn checking next attribute on next-named alias of table. The first question (3.1) here - is SQLite query optimizer able to efficiently utilize indexes (suppose they are created properly) on many such self-cross-joins?

Or in other words, this could be viewed as intersection of results (ID) of many small queries, on one attribute each.

Here is tightly coupled question of compound indexes (3.2), as there are other, more traditional, tables also. Suppose I have table with several columns and I can make indexes on them like

CREATE INDEX t_idx ON table(col_a, col_b);

or two distinct indexes:

CREATE INDEX t_idxa ON table(col_a); CREATE INDEX t_idxb ON table(col_b);

Logically these seem equivalent, and in some simple cases I could infer from on-disk format that for queries like "a = val1 AND b > val2" compound index will be better and save some space. For other queries it's better to have separate indexes, but will SQLite optimizer be able to utilize those separarte indexes?

But on https://sqlite.org/queryplanner.html I read:

One can see how the OR-by-UNION technique could also be leveraged to use multiple indices on queries where the WHERE clause has terms connected by AND, by using an intersect operator in place of union. Many SQL database engines will do just that. But the performance gain over using just a single index is slight and so SQLite does not implement that technique at this time. However, a future version SQLite might be enhanced to support AND-by-INTERSECT.

I don't quite understand what this means, especially in application to my task (3.3).

The only way that SQLite can know that there are many duplicates in the left-most columns of an index is if the ANALYZE command has been run on the database. Without the results of ANALYZE, SQLite has to guess at the "shape" of the data in the table, and the default guess is that there are an average of 10 duplicates for every value in the left-most column of the index. Skip-scan only becomes profitable (it only gets to be faster than a full table scan) when the number of duplicates is about 18 or more. Hence, a skip-scan is never used on a database that has not been analyzed.

So if I give something like a CHECK is_array IN (0,1) like an example was there - it is not enough for optimizer, still ANALYZE need to be run? (3.4)

Each table in the FROM clause of a query can use at most one index (except when the OR-clause optimization comes into play)

Wait, REALLY ?! Then my questions 3.1 & 3.2 are meaningless now?.. Or does it count towards physical table or rather an alias in JOIN (so could be circumvented by aliasing in joins) ?

Please help in this adding NoSQL to SQLite :)

WBR, nuclight

^{^} in Perl notation, but it is similar to JSON so should be intuitive enough - the only probably unknown thing there is bless(struct, class) which means "make this struct instance of that class"