FTS5 Query times
(1) By Spaced Cowboy (SpacedCowboy) on 2020-08-11 22:19:09 [source]
I'm trying to make SQLite index a bunch of emails, and make them searchable by a bunch of criteria (full-text, flags, from header, ...)
To do this I created a 'body' table for the email body, and created an FTS index on 3 criteria - body, subject and from, thusly:
/***************************************************************************\
|* We store the actual content in a different database
|*
|* Update: No we don't. It's all in the same database, but in a different
|* table - I think there's a way to do this using a virtual table that
|* references a foreign database via a C interface, but you can't ATTACH
|* a database currently, and use a table in that attached DB as the content.
|* It'd sure be nice to be able to store the content remotely...
\***************************************************************************/
"DROP TABLE IF EXISTS body",
"CREATE TABLE body"
" (" // implicit rowid
" body BLOB," // The body
" subject BLOB," // The subject
" fromTxt BLOB" // name & email
" )",
/***************************************************************************\
|* Add the FTS index
\***************************************************************************/
"DROP TABLE IF EXISTS fts",
"CREATE VIRTUAL TABLE fts " // full-text search
"USING fts5("
" body," // from body tbl
" subject," // from body tbl
" fromTxt," // from body tbl
" content=body," // table to store in
" tokenize = 'porter unicode61'" // handle stemming
" )",
/***************************************************************************\
|* Experiment with prefix tables for speed vs space
\***************************************************************************/
"CREATE VIRTUAL TABLE fts_prefix " // prefix index
"USING fts5("
" body," // from body tbl
" subject," // from body tbl
" fromTxt," // from body tbl
" prefix='2 3 4 5'" // Prefix lengths
" )",
So, after ingesting my mailboxes, there's about 65k emails in there...
@tanelorn mdir % tsql mail.db
tsql> select count(*) from email;
+----------+
| count(*) |
+----------+
| 65992 |
+----------+
Time taken: 0.012128 secs
... and to make the search query as generic as possible, I opted to do a series of 'select id/rowid using criteria-1 INTERSECT select id/rowid using criteria-2 INTERSECT ...' so I could switch in and out any of the criteria that the user has selected. I figured that I could get an INTERSECTion of all the various criteria to get a bunch (say 25) of row-ids, then go off and get the data using just those row-ids.
This works well on standard tables, where I have an 'id' column defined as INTEGER PRIMARY KEY (Rows omitted for brevity - there's no information content here)
tsql> SELECT id, subject, createDate, size, attachments FROM email WHERE id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25);
Time taken: 0.000566 secs
but when I try it on the FTS table, I'm getting a far longer response even though it eventually returns exactly the same rows - this is what led me to try the fts_prefix table creation.
tsql> SELECT rowId FROM fts('th*') WHERE rowid IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25);
Time taken: 3.644571 secs
However, if I do:
tsql> SELECT rowId, SNIPPET(fts, 0, '<b><u>', '</u></b>', '...', 48) FROM fts('th*') limit 25;
Time taken: 0.134950 secs
... which leads me to believe it's more of a "we're not using an index" problem. I've just re-read the SQLite documentation about rowid, and it does seem to say that any table without a PRIMARY KEY defined automatically gets a 'rowid' column applied to it, which is keyed.
I can work around it - the 'body' table has the content, and it's indexed with a PRIMARY KEY, so I can implement the snippet() function myself and just pull the data from there, which is much faster:
tsql> SELECT rowId,body FROM body WHERE rowid IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25);
Time taken: 0.000536 secs
None of the examples I've seen ever apply an INTEGER PRIMARY KEY to the fts5 table either. Is the way I'm going about it just an anti-pattern, or have I missed something ? Because currently it can do the main query with a half-dozen criteria (including one to the FTS table) in the blink of an eye, and then take a subjective eternity to return snippets from the actual 25 rows from the FTS table that I'm interested in :)
Suggestions very much gratefully received :)