SQLite User Forum

Trigram indexes for SQLite
Login

Trigram indexes for SQLite

(1.1) By Simon Willison (simonw) on 2020-09-16 22:07:06 edited from 1.0 [link] [source]

One of my favourite little-known PostgreSQL features is trigram indexes. They let you speed up queries that need to match substrings within a text column - queries like this:

select * from articles where title like '%smil%'

There's a great explanation of how these work in PostgreSQL here: https://about.gitlab.com/blog/2016/03/18/fast-search-using-postgresql-trigram-indexes/#trigram-indexes

I would love to have a version of this functionality in SQLite. There's an experimental module for that here, but it's not seen any work in seven years: https://github.com/jonasfj/trilite

Has anyone seen a SQLite-oriented solution for this?

(2) By Simon Slavin (slavin) on 2020-09-16 23:21:13 in reply to 1.1 [link] [source]

The reason you can't find external solutions is that it's built into SQLite.

https://www.sqlite.org/fts5.html

FTS5 is a recent addition to SQLite and you may not find a lot of examples to learn from. If you prefer to stick with tried and tested solutions you may prefer

https://www.sqlite.org/fts3.html

(3) By Scott Robison (casaderobison) on 2020-09-16 23:49:12 in reply to 2 [link] [source]

Except a trigram sequence can speed up LIKE regardless of the position of characters in a column. FTS5 won't (I don't think) support substrings within a longer string, only prefix or exact matches of tokens. For example:

SELECT * FROM tab WHERE col LIKE '%cot%'

Will return 'cot', 'rural cottage', 'cottage cheese', and 'Scott' (among other possibilities). I don't think FTS5 would support that sort of a search, where cot might be embedded within a word as it is in my name.

(4) By Simon Willison (simonw) on 2020-09-17 01:49:50 in reply to 3 [link] [source]

Yes exactly - FTS5 is amazing but it doesn't solve internal substrings in the same way that trigrams do.

(5) By Simon Willison (simonw) on 2020-09-17 01:52:59 in reply to 4 [link] [source]

... though now I'm wondering if one could use a custom FTS tokenizer to create a trigram index.

(6) By Scott Robison (casaderobison) on 2020-09-17 03:07:52 in reply to 5 [link] [source]

You could definitely tokenize a string into three character overlapping chunks. One problem might be tracking the overlapping of strings. For example, if you had a string 'ABCDEF' and you searched for 'DEFABC', you would find 'ABC' and 'DEF'. You wouldn't find 'EFA' or 'FAB'. So your MATCH clause would probably need to be customized based on what you were searching for. So the FTS index could have all the trigrams, but matching would be a little more complex than a simple LIKE. However, it should be quite fast to find even with the more complicated query formatting, and probably much easier than adding actual trigram indexes to SQLite.

(7) By Andreas Kupries (andreas-kupries) on 2020-09-17 03:12:01 in reply to 1.1 [link] [source]

An alternate to these trigram indices is a schema of the form

table word   (id, string);
table suffix (id, string);
table link   (suffix, word);
index link_s on link (suffix);

The basic idea behind this is that any substring S of a word W is the prefix of a suffix of W.

Whenever a word W is added to word, all the suffices of W are added to suffix, and link is extended also, to map from the suffices to W.

Now a substring search for the words matching %foo% becomes a prefix search foo% in suffix, and we get the relevant word id by joining link.

Roughly:

SELECT DISTINCT word
FROM   link
WHERE  suffix IN (SELECT id
	          FROM   suffix
	          WHERE  suffix LIKE 'foo%')

Note, this gives us only the word ids. To get the actual words we have to join word as well. That exercise is left for the reader.

(8) By Simon Willison (simonw) on 2020-09-17 14:59:01 in reply to 7 [link] [source]

That's really clever, and seems like it should be pretty easy to implement. Thanks!

(9) By Dan Kennedy (dan) on 2020-09-17 16:23:36 in reply to 5 [link] [source]

It's a fun idea. With the extension code in the other post, I created a dictionary db with:

    rm dict.sql
    echo ".load ./ftstri" > dict.sql
    echo "CREATE VIRTUAL TABLE dict USING fts5(word, tokenize=tri);" >> dict.sql
    echo "BEGIN;" >> dict.sql
    cat /usr/share/dict/words                    |
        sed "s/'/''/g"                       |
        sed "y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/" |
        sed "s/^/INSERT INTO dict values('/" |
        sed "s/$/');/"                       >> dict.sql
    echo "COMMIT;" >> dict.sql

    rm dict.db
    ./sqlite3 dict.db ".load ./ftstri" ".read dict.sql"

Then run some queries using the FTS5 syntax, except query tokens can match anywhere in the record text. Tokens of any length greater than 3 work quite nicely due to the way fts5 handles phrases. The AND and OR operators seem to work ok too. I suspect NEAR would give you distance in characters, but I didn't try it. The fts5 highlight() function works too, but it seems to malfunction if two or more query tokens overlap.

    sqlite> .load ./ftstri
    sqlite> .mode box
    sqlite> select * from dict('wak') LIMIT 5;
    ┌────────────┐
    │    word    │    
    ├────────────┤
    │ arawak     │    
    │ arawak's   │
    │ arawakan   │
    │ arawakan's │
    │ kwakiutl   │
    └────────────┘
    sqlite> select * from dict('wak AND ned') LIMIT 5;
    ┌────────────┐
    │    word    │    
    ├────────────┤
    │ awakened   │
    │ reawakened │
    │ wakened    │    
    └────────────┘
    sqlite> select highlight(dict, 0, '(', ')') from dict('wak AND ned') LIMIT 5;
    ┌──────────────────────────────┐
    │ highlight(dict, 0, '(', ')') │
    ├──────────────────────────────┤
    │ a(wak)e(ned)                 │    
    │ rea(wak)e(ned)               │    
    │ (wak)e(ned)                  │    
    └──────────────────────────────┘
    sqlite> select highlight(dict, 0, '(', ')') from dict('wak OR ned') LIMIT 5;
    ┌──────────────────────────────┐
    │ highlight(dict, 0, '(', ')') │
    ├──────────────────────────────┤
    │ ara(wak)                     │    
    │ ara(wak)'s                   │    
    │ ara(wak)an                   │    
    │ ara(wak)an's                 │    
    │ be(ned)ict                   │    
    └──────────────────────────────┘
    sqlite> select highlight(dict, 0, '(', ')') from dict('waken') LIMIT 5;
    ┌──────────────────────────────┐
    │ highlight(dict, 0, '(', ')') │
    ├──────────────────────────────┤
    │ a(waken)                     │    
    │ a(waken)ed                   │    
    │ a(waken)ing                  │    
    │ a(waken)ing's                │    
    │ a(waken)ings                 │    
    └──────────────────────────────┘

Details here:

https://sqlite.org/fts5.html

(10.1) By Andreas Kupries (andreas-kupries) on 2020-09-17 17:46:21 edited from 10.0 in reply to 8 [link] [source]

You are welcome.

I do not believe that the idea is original to me, although I do not remember having it seen anywhere else.

I should note that I expect the suffix table to be quite a bit larger than the word table, and the link table to be much larger. Plus the index on link.

So this looks to be very much trading memory/space for speed.

(11) By Dan Kennedy (dan) on 2020-09-17 16:27:00 in reply to 5 [source]

Code for the "ftstri" extension that provides the FTS5 tokenizer tri from the other post. I built it here on Linux with the usual recipe for building SQLite extensions:

    gcc -I. -g -fPIC -shared ftstri.c -o ftstri.so

The C code:

/**********************************************************************/
#include "sqlite3ext.h"
SQLITE_EXTENSION_INIT1
#include <assert.h>
#include <string.h>
#include <fts5.h>

static fts5_api *fts5_api_from_db(sqlite3 *db){
  fts5_api *pRet = 0;
  sqlite3_stmt *pStmt = 0;

  if( SQLITE_OK==sqlite3_prepare(db, "SELECT fts5(?1)", -1, &pStmt, 0) ){
    sqlite3_bind_pointer(pStmt, 1, (void*)&pRet, "fts5_api_ptr", NULL);
    sqlite3_step(pStmt);
  }
  sqlite3_finalize(pStmt);
  return pRet;
}

static int dummy = 0;

static int ftsTriCreate(
  void *pCtx,
  const char **azArg,
  int nArg,
  Fts5Tokenizer **ppOut
){
  *ppOut = (Fts5Tokenizer*)&dummy;
  return SQLITE_OK;
}

static void ftsTriDelete(Fts5Tokenizer *p){
  assert( p==(Fts5Tokenizer*)&dummy );
}

static int ftsTriTokenize(
  Fts5Tokenizer *pUnused,
  void *pCtx,
  int flags,
  const char *pText, int nText,
  int (*xToken)(void*, int, const char*, int, int, int)
){
  int i;
  for(i=0; i<nText-2; i++){
    int rc = xToken(pCtx, 0, &pText[i], 3, i, i+3);
    if( rc ) return rc;
  }
  return SQLITE_OK;
}

static int ftsTriInstall(sqlite3 *db){
  fts5_api *pApi;
  fts5_tokenizer tok;
  tok.xCreate = ftsTriCreate;
  tok.xDelete = ftsTriDelete;
  tok.xTokenize = ftsTriTokenize;

  pApi = fts5_api_from_db(db);
  if( pApi==0 ) return 0;

  return pApi->xCreateTokenizer(pApi, "tri", 0, &tok, 0);
}

#ifdef _WIN32
__declspec(dllexport)
#endif
int sqlite3_ftstri_init(
  sqlite3 *db,
  char **pzErrMsg,
  const sqlite3_api_routines *pApi
){
  SQLITE_EXTENSION_INIT2(pApi);
  return ftsTriInstall(db);
}

(12) By Simon Willison (simonw) on 2020-09-18 20:56:56 in reply to 9 [link] [source]

Whoa. I just got that working on my Mac - I had to copy ./sqlite/ext/fts5/fts5.h into the same directory as that ftstri.c file but once I did that the following:

gcc -I. -g -fPIC -shared ftstri.c -o ftstri.so

Did indeed produce a ftstri.so extension which I successfully loaded into SQLite using Python!

ftstri % python3
Python 3.8.5 (default, Jul 21 2020, 10:48:26) 
[Clang 11.0.3 (clang-1103.0.32.62)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>> c = sqlite3.connect(":memory:")
>>> c
<sqlite3.Connection object at 0x107e137b0>
>>> c.enable_load_extension(True)
>>> c.load_extension("ftstri.so")
>>> c
<sqlite3.Connection object at 0x107e137b0>
>>> c.execute("CREATE VIRTUAL TABLE dict USING fts5(word, tokenize=tri);")
<sqlite3.Cursor object at 0x107e9f880>
>>> c.execute('INSERT INTO dict values ("simon")')
<sqlite3.Cursor object at 0x107e9f8f0>
>>> c.execute('INSERT INTO dict values ("cleo")')
<sqlite3.Cursor object at 0x107e9f880>
>>> c.execute('INSERT INTO dict values ("natalie")')
<sqlite3.Cursor object at 0x107e9f8f0>
>>> c.execute('select * from dict(?)', ['simon']).fetchall()
[('simon',)]
>>> c.execute('select * from dict(?)', ['sim']).fetchall()
[('simon',)]
>>> c.execute('select * from dict(?)', ['imo']).fetchall()
[('simon',)]
>>> 
>>> 
>>> 
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['sim']).fetchall()
[('(sim)on',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['simon']).fetchall()
[('(simon)',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['mon']).fetchall()
[('si(mon)',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['cl']).fetchall()
[]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['cleo']).fetchall()
[('(cleo)',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['cle']).fetchall()
[('(cle)o',)]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['cleop']).fetchall()
[]
>>> c.execute("select highlight(dict, 0, '(', ')') from dict(?)", ['nat']).fetchall()
[('(nat)alie',)]

(13) By Simon Willison (simonw) on 2020-09-18 20:58:11 in reply to 11 [link] [source]

Do you mind if I create a GitHub repo for this, crediting you as the author of the code and adding some documentation on how to build it? Is the same license as SQLite OK?

(14) By Dan Kennedy (dan) on 2020-09-19 10:09:46 in reply to 13 [link] [source]

By all means. Please do!