SQLite Forum

fts5 tokenizer not able to detect end of parent token stream
Login

fts5 tokenizer not able to detect end of parent token stream

(1) By Michael Gauthier (gauthierm) on 2021-09-21 00:35:54 [source]

I'm building a tokenizer that lives in the tokenizer chain after unicode61. In my tokenizer I'd really like to know when the end of the token stream has been reached. I can almost do this by comparing iEnd (end of current token string) to nText (original input string length).

It doesn't work when the string ends in a character that would be stripped out by unicode61 (like a space or a . character).

Similarly if I implement a stopwords tokenizer and the last word in the input string is a stopword it's no longer possible to detect the end of the stream.

It would be nice to have a new tflags bit FTS5_TOKEN_FINAL passed to xToken that tokenizers could set call to indicate the token is the last token in the input stream.

My current workaround is to search for unicode characters at the end of the string in my xTokenize and saving a modified nText value in my callback context. This means my tokenizer re-implements a lot of the unicode61 tokenizer (or any other tokenizer that would change the input string length).