fts5 tokenizer not able to detect end of parent token stream
(1) By Michael Gauthier (gauthierm) on 2021-09-21 00:35:54 [source]
I'm building a tokenizer that lives in the tokenizer chain after
unicode61. In my tokenizer I'd really like to know when the end of the token stream has been reached. I can almost do this by comparing
iEnd (end of current token string) to
nText (original input string length).
It doesn't work when the string ends in a character that would be stripped out by
unicode61 (like a space or a . character).
Similarly if I implement a stopwords tokenizer and the last word in the input string is a stopword it's no longer possible to detect the end of the stream.
It would be nice to have a new
FTS5_TOKEN_FINAL passed to
xToken that tokenizers could set call to indicate the token is the last token in the input stream.
My current workaround is to search for unicode characters at the end of the string in my
xTokenize and saving a modified
nText value in my callback context. This means my tokenizer re-implements a lot of the
unicode61 tokenizer (or any other tokenizer that would change the input string length).