FTS5 tables, = vs. MATCH, index 0, and "error: no such column"
(1) By David Stein (sfsdfd) on 2020-10-10 14:03:41 [link] [source]
The SQLite FTS5 Extension page states as follows:
The following three queries are equivalent.
SELECT * FROM email WHERE email MATCH 'fts5';
SELECT * FROM email WHERE email = 'fts5';
SELECT * FROM email('fts5');
This statement is correct (well, mostly: see below). However, I have encountered quite serious performance issues with using the = operator on big tables (1,000,000 records or more) with unique values, where such issues do not apply to the MATCH operator.
Performance Issues
To illustrate (and validate my understanding of) the problem, I wrote some code that generates an FTS5 table ("test") with a single column ("content") and fills it with 10,000,000 records, each with a unique integer. It then executes 100 queries to search for one of the records using the MATCH operator and displays the average time. Finally, it performs the same test using the = operator.
import os, random, sqlite3, sys, time
filename = 'fts5_test.sqlite'
if os.path.isfile(filename):
os.unlink(filename)
db = sqlite3.connect('fts5_test.sqlite')
db.execute('create virtual table test using fts5(content)')
size = 10000000; batch_size = 100000; trials = 100
for i in range(0, size, batch_size):
values = tuple((i + j,) for j in range(batch_size))
db.executemany(f'insert into test values (?)', values)
avg_time = 0
for i in range(trials):
start = time.time()
db.execute("select 1 from test where content match ?", (random.randint(1, size),)).fetchall()
avg_time += time.time() - start
print(f'\MATCH: Trial {i + 1} - average time per lookup: {avg_time / (i + 1):<1.4} seconds', end='')
print('')
avg_time = 0
for i in range(trials):
start = time.time()
db.execute("select 1 from test where content = ?", (random.randint(1, size),)).fetchall()
avg_time += time.time() - start
print(f'\r=: Trial {i + 1} - average time per lookup: {avg_time / (i + 1):<1.4} seconds', end='')
print('')
Results:
MATCH: Trial 100 - average time per lookup: 6.254e-05 seconds
=: Trial 100 - average time per lookup: 1.599 seconds
The MATCH operator performs literally 25,000 times faster than the = operator. This is not consistent with my understanding of the term "equivalent."
More Detail
This difference is baffling, because if there is any performance difference, I would expect the exact opposite behavior: = is a simple value comparison that can likely be performed in O(1) with a hashcode lookup, while MATCH is a much more sophisticated and feature-rich text comparison.
However:
sqlite> explain query plan select * from test where content match -1;
QUERY PLAN
`--SCAN TABLE test VIRTUAL TABLE INDEX 1:
sqlite> explain query plan select * from test where content = -1;
QUERY PLAN
`--SCAN TABLE test VIRTUAL TABLE INDEX 0:
Apparently, SQLite says that it's using INDEX 0 for this query, but the performance issues suggest that it's just scanning the entire table. I can't tell whether "INDEX 0" is supposed to indicate an unindexed operation, or that SQLite thinks it's using an index, and is just using it very badly.
Use Case
I have a big data set with records that include both primitive values (date, int, simple strings, etc.) and text that I'd like to be text-searchable. I understand, from the FTS5 documentation, that an FTS5 table requires all columns to be FTS5-indexed. Therefore, for each record, I store the primitive values and a primary key in a "main" table, and I store the FTS5 values in an FTS5 table. I relate the tables by including a column in the FTS5 table that contains the primary key of the "main" table.
Storing the data works fine. However, a problem arises when recomposing the record (or, worse, lots of records) from the FTS5 table, because I need to index into the FTS5 table to find each record by the primary key:
sqlite> select * from fts5_table where primary_key = '12345';
As the table fills up, this query takes forever to execute. I first noticed the problem after creating some really big tables (400 gigabytes of text in the FTS5 table) and then trying to retrieve them - an operation that should have taken a few minutes was still running after 8 hours, and just reading, endlessly, from disk.
My solution was to replace the = operator with MATCH, and the problems resolved.
I will acknowledge that my use case is perhaps a bit off-label: that using an FTS5 index for 10,000,000 records to store a column with 10,000,000 small, primitive values is atypical. However, the objective of wanting to relate records in an FTS5 table to another table through a foreign-key relation is pretty standard. Also, this could be avoided by creating the FTS5 table with a non-FTS5 column that can be normally indexed and normally queried - but the limitation that FTS5 tables can ONLY have FTS5-indexed columns is apparently fixed and cannot be circumvented.
Additional Problems with MATCH and Negative Numbers ("error: no such column")
Actually, the general statement about = and MATCH being equivalent is also incorrect. I encountered new issues with tables for which the primary key is a negative number, such as a 64-bit signed hashcode. For instance:
sqlite> select * from test where content = 1; # OK
sqlite> select * from test where content = -1; # OK
sqlite> select * from test where content match 1; # OK
sqlite> select * from test where content match -1; # error: no such column: 1
The error message threw me for a loop, since the query isn't asking for "column 1," but for value -1 in column "test." As it happens, FTS5 has this goofy behavior of interpreting dashes in the fields of a MATCH query as a column filter. See the Stack article entitled: "sqlite full text search sqlite breaks when there is a dash in input" for more info. The apparent solution is to enclose the values in TWO quoted characters:
sqlite> select * from test where content match '-1'; # error: no such column: 1
sqlite> select * from test where content match "-1"; # error: no such column: 1
sqlite> select * from test where content match '"-1"'; # OK
...which actually works, despite being EXTREMELY janky. Note that the order matters - it must be double quotes INSIDE single quotes - and that the opposite order fails:
sqlite> select * from test where content match "'-1'"; # error: syntax error near "'"
This is an extremely janky solution, but it does work. Hope this is helpful to anybody who encounters the peculiar "error: no such column" message. (Also, the error message should be changed, since it is quite misleading and even surprising.)
(2) By Dan Kennedy (dan) on 2020-10-10 14:18:01 in reply to 1 [link] [source]
The "=" and MATCH operators are only equivalent if the LHS of the expression is the table name, not a column name, as it is in the examples.
The janky query syntax is documented here:
https://sqlite.org/fts5.html#full_text_query_syntax
Dan.
(3.1) By David Stein (sfsdfd) on 2020-10-10 21:25:15 edited from 3.0 in reply to 2 [source]
Deleted(4) By David Stein (sfsdfd) on 2020-10-10 21:27:17 in reply to 2 [link] [source]
Acknowledged. I still have two questions:
(1) What's the deal with INDEX 0? Does it mean unindexed operation, or does it mean that there IS an index, but it's just not working for some reason?
(2) I understand that the UNINDEXED option allows me to exclude a column from the FTS5 indexing - but would there be any way to index the column like a normal column in a non-FTS5 table? (Because that would solve a lot of problems!)
(5) By Dan Kennedy (dan) on 2020-10-12 16:30:23 in reply to 4 [link] [source]
If it just says "INDEX 0:", it means a linear scan of the entire table. But, for example, "INDEX 0:M1", means a full-text query. Technically the bit after "INDEX" is the idxNum value, a ":" character, and the idxStr value:
https://sqlite.org/vtab.html#outputs
http://www.sqlite.org/src/artifact/b4e4931c7fcc9?ln=545
You can mess around with external content tables to get a data structure that supports full-text queries and other indexes as well, but there's no way to just add an index to a column of an fts5 table.
https://sqlite.org/fts5.html#external_content_tables
In some cases you could add a MATCH clause to prune the number of rows scanned. e.g. change:
SELECT * FROM fts5_tab WHERE c='abc';
to
SELECT * FROM fts5_tab WHERE c='abc' AND c MATCH 'abc';
I guess in theory fts5 could do this automatically. It doesn't at the moment though.
Dan