SQLite Forum

List of available FTS tokenizers
Login

List of available FTS tokenizers

(1) By Karl Bartel (karl42) on 2022-01-06 20:24:05 [link] [source]

Since both FTS3/4 and FTS5 provide ways to supply custom tokenizers, I was expecting that people have developed a multitude of different tokenizers over the years. However, I have trouble finding tokenizers for languages with non-latin script (e.g. Greek or Japanese) except for the built in ICU tokenzier and I hardly find any mature tokenizers in general (outside of the built-in ones).

Is there an overview over existing tokenizers somewhere? Can someone point me to any interesting related resources? Is someone working on an ICU tokenizer for FTS5?

Thanks,
Karl

(2) By Vadim Goncharov (nuclight) on 2022-01-09 20:59:12 in reply to 1 [source]

Don't know about good (exhaustive) list, but there are some tokenizers available at different programming languages' libraries, e.g. there is https://metacpan.org/pod/Lingua::Stem::Snowball and https://metacpan.org/pod/Lingua::Stem for Perl (see Russian as example of non-Latin script), along with Search::Tokenizer module for writing own tokenizers referenced from https://metacpan.org/pod/DBD::SQLite::Fulltext_search (however, only European languages there).

(3) By Karl Bartel (karl42) on 2022-01-17 07:00:22 in reply to 2 [link] [source]

Thanks for the answer. After looking around a bit more, https://github.com/abiliojr/fts5-snowball looks interesting too, since it works on the C level and is thus available to most SQLite users, not just those using Perl.

Snowball has the limitation that it is only a stemmer and does not do word splitting for languages where this is non-trivial (i.e. word boundaries are not indicated by spaces).