Trigram indexes for SQLite
(1.1) By Simon Willison (simonw) on 2020-09-16 22:07:06 edited from 1.0 [link] [source]
One of my favourite little-known PostgreSQL features is trigram indexes. They let you speed up queries that need to match substrings within a text column - queries like this:
select * from articles where title like '%smil%'
There's a great explanation of how these work in PostgreSQL here: https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/#trigram-indexes
I would love to have a version of this functionality in SQLite. There's an experimental module for that here, but it's not seen any work in seven years: https://github.com/jonasfj/trilite
Has anyone seen a SQLite-oriented solution for this?
(2) By Simon Slavin (slavin) on 2020-09-16 23:21:13 in reply to 1.1 [link] [source]
The reason you can't find external solutions is that it's built into SQLite.
https://www.sqlite.org/fts5.html
FTS5 is a recent addition to SQLite and you may not find a lot of examples to learn from. If you prefer to stick with tried and tested solutions you may prefer
(3) By Scott Robison (casaderobison) on 2020-09-16 23:49:12 in reply to 2 [link] [source]
Except a trigram sequence can speed up LIKE regardless of the position of characters in a column. FTS5 won't (I don't think) support substrings within a longer string, only prefix or exact matches of tokens. For example:
SELECT * FROM tab WHERE col LIKE '%cot%'
Will return 'cot', 'rural cottage', 'cottage cheese', and 'Scott' (among other possibilities). I don't think FTS5 would support that sort of a search, where cot might be embedded within a word as it is in my name.
(4) By Simon Willison (simonw) on 2020-09-17 01:49:50 in reply to 3 [link] [source]
Yes exactly - FTS5 is amazing but it doesn't solve internal substrings in the same way that trigrams do.
(5) By Simon Willison (simonw) on 2020-09-17 01:52:59 in reply to 4 [link] [source]
... though now I'm wondering if one could use a custom FTS tokenizer to create a trigram index.
(6) By Scott Robison (casaderobison) on 2020-09-17 03:07:52 in reply to 5 [link] [source]
You could definitely tokenize a string into three character overlapping chunks. One problem might be tracking the overlapping of strings. For example, if you had a string 'ABCDEF' and you searched for 'DEFABC', you would find 'ABC' and 'DEF'. You wouldn't find 'EFA' or 'FAB'. So your MATCH clause would probably need to be customized based on what you were searching for. So the FTS index could have all the trigrams, but matching would be a little more complex than a simple LIKE. However, it should be quite fast to find even with the more complicated query formatting, and probably much easier than adding actual trigram indexes to SQLite.
(7) By Andreas Kupries (andreas-kupries) on 2020-09-17 03:12:01 in reply to 1.1 [link] [source]
An alternate to these trigram indices is a schema of the form
table word (id, string);
table suffix (id, string);
table link (suffix, word);
index link_s on link (suffix);
The basic idea behind this is that any substring S of a word W is the prefix of a suffix of W.
Whenever a word W is added to word
, all the suffices of W are
added to suffix
, and link
is extended also, to map from the
suffices to W.
Now a substring search for the words matching %foo%
becomes a
prefix search foo%
in suffix
, and we get the relevant word
id by joining link
.
Roughly:
SELECT DISTINCT word
FROM link
WHERE suffix IN (SELECT id
FROM suffix
WHERE suffix LIKE 'foo%')
Note, this gives us only the word ids. To get the actual words
we have to join word
as well. That exercise is left for the
reader.
(8) By Simon Willison (simonw) on 2020-09-17 14:59:01 in reply to 7 [link] [source]
That's really clever, and seems like it should be pretty easy to implement. Thanks!
(9) By Dan Kennedy (dan) on 2020-09-17 16:23:36 in reply to 5 [link] [source]
It's a fun idea. With the extension code in the other post, I created a dictionary db with:
rm dict.sql echo ".load ./ftstri" > dict.sql echo "CREATE VIRTUAL TABLE dict USING fts5(word, tokenize=tri);" >> dict.sql echo "BEGIN;" >> dict.sql cat /usr/share/dict/words | sed "s/'/''/g" | sed "y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/" | sed "s/^/INSERT INTO dict values('/" | sed "s/$/');/" >> dict.sql echo "COMMIT;" >> dict.sql rm dict.db ./sqlite3 dict.db ".load ./ftstri" ".read dict.sql"
Then run some queries using the FTS5 syntax, except query tokens can match anywhere in the record text. Tokens of any length greater than 3 work quite nicely due to the way fts5 handles phrases. The AND and OR operators seem to work ok too. I suspect NEAR would give you distance in characters, but I didn't try it. The fts5 highlight() function works too, but it seems to malfunction if two or more query tokens overlap.
sqlite> .load ./ftstri sqlite> .mode box sqlite> select * from dict('wak') LIMIT 5; ┌────────────┐ │ word │ ├────────────┤ │ arawak │ │ arawak's │ │ arawakan │ │ arawakan's │ │ kwakiutl │ └────────────┘ sqlite> select * from dict('wak AND ned') LIMIT 5; ┌────────────┐ │ word │ ├────────────┤ │ awakened │ │ reawakened │ │ wakened │ └────────────┘ sqlite> select highlight(dict, 0, '(', ')') from dict('wak AND ned') LIMIT 5; ┌──────────────────────────────┐ │ highlight(dict, 0, '(', ')') │ ├──────────────────────────────┤ │ a(wak)e(ned) │ │ rea(wak)e(ned) │ │ (wak)e(ned) │ └──────────────────────────────┘ sqlite> select highlight(dict, 0, '(', ')') from dict('wak OR ned') LIMIT 5; ┌──────────────────────────────┐ │ highlight(dict, 0, '(', ')') │ ├──────────────────────────────┤ │ ara(wak) │ │ ara(wak)'s │ │ ara(wak)an │ │ ara(wak)an's │ │ be(ned)ict │ └──────────────────────────────┘ sqlite> select highlight(dict, 0, '(', ')') from dict('waken') LIMIT 5; ┌──────────────────────────────┐ │ highlight(dict, 0, '(', ')') │ ├──────────────────────────────┤ │ a(waken) │ │ a(waken)ed │ │ a(waken)ing │ │ a(waken)ing's │ │ a(waken)ings │ └──────────────────────────────┘
Details here:
(10.1) By Andreas Kupries (andreas-kupries) on 2020-09-17 17:46:21 edited from 10.0 in reply to 8 [link] [source]
You are welcome.
I do not believe that the idea is original to me, although I do not remember having it seen anywhere else.
I should note that I expect the suffix
table to be quite a bit larger than the word
table, and the link
table to be much larger. Plus the index on link
.
So this looks to be very much trading memory/space for speed.
(11) By Dan Kennedy (dan) on 2020-09-17 16:27:00 in reply to 5 [source]
Code for the "ftstri" extension that provides the FTS5 tokenizer tri from the other post. I built it here on Linux with the usual recipe for building SQLite extensions:
gcc -I. -g -fPIC -shared ftstri.c -o ftstri.so
The C code:
/**********************************************************************/ #include "sqlite3ext.h" SQLITE_EXTENSION_INIT1 #include <assert.h> #include <string.h> #include <fts5.h> static fts5_api *fts5_api_from_db(sqlite3 *db){ fts5_api *pRet = 0; sqlite3_stmt *pStmt = 0; if( SQLITE_OK==sqlite3_prepare(db, "SELECT fts5(?1)", -1, &pStmt, 0) ){ sqlite3_bind_pointer(pStmt, 1, (void*)&pRet, "fts5_api_ptr", NULL); sqlite3_step(pStmt); } sqlite3_finalize(pStmt); return pRet; } static int dummy = 0; static int ftsTriCreate( void *pCtx, const char **azArg, int nArg, Fts5Tokenizer **ppOut ){ *ppOut = (Fts5Tokenizer*)&dummy; return SQLITE_OK; } static void ftsTriDelete(Fts5Tokenizer *p){ assert( p==(Fts5Tokenizer*)&dummy ); } static int ftsTriTokenize( Fts5Tokenizer *pUnused, void *pCtx, int flags, const char *pText, int nText, int (*xToken)(void*, int, const char*, int, int, int) ){ int i; for(i=0; i<nText-2; i++){ int rc = xToken(pCtx, 0, &pText[i], 3, i, i+3); if( rc ) return rc; } return SQLITE_OK; } static int ftsTriInstall(sqlite3 *db){ fts5_api *pApi; fts5_tokenizer tok; tok.xCreate = ftsTriCreate; tok.xDelete = ftsTriDelete; tok.xTokenize = ftsTriTokenize; pApi = fts5_api_from_db(db); if( pApi==0 ) return 0; return pApi->xCreateTokenizer(pApi, "tri", 0, &tok, 0); } #ifdef _WIN32 __declspec(dllexport) #endif int sqlite3_ftstri_init( sqlite3 *db, char **pzErrMsg, const sqlite3_api_routines *pApi ){ SQLITE_EXTENSION_INIT2(pApi); return ftsTriInstall(db); }
(12) By Simon Willison (simonw) on 2020-09-18 20:56:56 in reply to 9 [link] [source]
Whoa. I just got that working on my Mac - I had to copy ./sqlite/ext/fts5/fts5.h
into the same directory as that ftstri.c
file but once I did that the following:
gcc -I. -g -fPIC -shared ftstri.c -o ftstri.so
Did indeed produce a ftstri.so
extension which I successfully loaded into SQLite using Python!
ftstri % python3
Python 3.8.5 (default, Jul 21 2020, 10:48:26)
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>> c = sqlite3.connect(":memory:")
>>> c
<sqlite3.Connection object at 0x107e137b0>
>>> c.enable_load_extension(True)
>>> c.load_extension("ftstri.so")
>>> c
<sqlite3.Connection object at 0x107e137b0>
>>> c.execute("CREATE VIRTUAL TABLE dict USING fts5(word, tokenize=tri);")
<sqlite3.Cursor object at 0x107e9f880>
>>> c.execute('INSERT INTO dict values ("simon")')
<sqlite3.Cursor object at 0x107e9f8f0>
>>> c.execute('INSERT INTO dict values ("cleo")')
<sqlite3.Cursor object at 0x107e9f880>
>>> c.execute('INSERT INTO dict values ("natalie")')
<sqlite3.Cursor object at 0x107e9f8f0>
>>> c.execute('select * from dict(?)', ['simon']).fetchall()
[('simon',)]
>>> c.execute('select * from dict(?)', ['sim']).fetchall()
[('simon',)]
>>> c.execute('select * from dict(?)', ['imo']).fetchall()
[('simon',)]
>>>
>>>
>>>
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['sim']).fetchall()
[('(sim)on',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['simon']).fetchall()
[('(simon)',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['mon']).fetchall()
[('si(mon)',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['cl']).fetchall()
[]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['cleo']).fetchall()
[('(cleo)',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['cle']).fetchall()
[('(cle)o',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['cleop']).fetchall()
[]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['nat']).fetchall()
[('(nat)alie',)]
(13) By Simon Willison (simonw) on 2020-09-18 20:58:11 in reply to 11 [link] [source]
Do you mind if I create a GitHub repo for this, crediting you as the author of the code and adding some documentation on how to build it? Is the same license as SQLite OK?
(14) By Dan Kennedy (dan) on 2020-09-19 10:09:46 in reply to 13 [link] [source]
By all means. Please do!