SQLite Forum

FTS5 Query times
Login

FTS5 Query times

(1) By Spaced Cowboy (SpacedCowboy) on 2020-08-11 22:19:09 [source]

I'm trying to make SQLite index a bunch of emails, and make them searchable by a bunch of criteria (full-text, flags, from header, ...)

To do this I created a 'body' table for the email body, and created an FTS index on 3 criteria - body, subject and from, thusly:


    /***************************************************************************\
    |* We store the actual content in a different database
    |*
    |* Update: No we don't. It's all in the same database, but in a different
    |* table - I think there's a way to do this using a virtual table that
    |* references a foreign database via a C interface, but you can't ATTACH
    |* a database currently, and use a table in that attached DB as the content.
    |* It'd sure be nice to be able to store the content remotely...
    \***************************************************************************/
    "DROP TABLE IF EXISTS body",
    
    "CREATE TABLE body"
    "	("						// implicit rowid
    "	body				BLOB,"		// The body
    "	subject				BLOB,"		// The subject
    "	fromTxt				BLOB"		// name & email
    "	)",

    /***************************************************************************\
    |* Add the FTS index
    \***************************************************************************/
    "DROP TABLE IF EXISTS fts",
	
    "CREATE VIRTUAL TABLE fts "    			 // full-text search
    "USING fts5("
    "	body,"                  	 		 // from body tbl
    "	subject,"					 // from body tbl
    "	fromTxt,"					 // from body tbl
    "	content=body,"					 // table to store in
    "	tokenize = 'porter unicode61'"			 // handle stemming
    "	)", 

    /***************************************************************************\
    |* Experiment with prefix tables for speed vs space
    \***************************************************************************/
    "CREATE VIRTUAL TABLE fts_prefix "			 // prefix index
    "USING fts5("
    "			body,"				 // from body tbl
    "			subject,"			 // from body tbl
    "			fromTxt,"			 // from body tbl
    "			prefix='2 3 4 5'"		 // Prefix lengths
    "			)",

So, after ingesting my mailboxes, there's about 65k emails in there...

@tanelorn mdir % tsql mail.db
tsql> select count(*) from email;
+----------+
| count(*) |
+----------+
| 65992    |
+----------+
Time taken: 0.012128 secs

... and to make the search query as generic as possible, I opted to do a series of 'select id/rowid using criteria-1 INTERSECT select id/rowid using criteria-2 INTERSECT ...' so I could switch in and out any of the criteria that the user has selected. I figured that I could get an INTERSECTion of all the various criteria to get a bunch (say 25) of row-ids, then go off and get the data using just those row-ids.

This works well on standard tables, where I have an 'id' column defined as INTEGER PRIMARY KEY (Rows omitted for brevity - there's no information content here)

tsql> SELECT id, subject, createDate, size, attachments FROM email WHERE id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25);
Time taken: 0.000566 secs

but when I try it on the FTS table, I'm getting a far longer response even though it eventually returns exactly the same rows - this is what led me to try the fts_prefix table creation.

tsql>  SELECT rowId FROM fts('th*') WHERE rowid IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25);
Time taken: 3.644571 secs

However, if I do:

tsql> SELECT rowId, SNIPPET(fts, 0, '<b><u>', '</u></b>', '...', 48) FROM fts('th*') limit 25;
Time taken: 0.134950 secs

... which leads me to believe it's more of a "we're not using an index" problem. I've just re-read the SQLite documentation about rowid, and it does seem to say that any table without a PRIMARY KEY defined automatically gets a 'rowid' column applied to it, which is keyed.

I can work around it - the 'body' table has the content, and it's indexed with a PRIMARY KEY, so I can implement the snippet() function myself and just pull the data from there, which is much faster:

tsql>  SELECT rowId,body FROM body WHERE rowid IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25);
Time taken: 0.000536 secs

None of the examples I've seen ever apply an INTEGER PRIMARY KEY to the fts5 table either. Is the way I'm going about it just an anti-pattern, or have I missed something ? Because currently it can do the main query with a half-dozen criteria (including one to the FTS table) in the blink of an eye, and then take a subjective eternity to return snippets from the actual 25 rows from the FTS table that I'm interested in :)

Suggestions very much gratefully received :)