SQLite Forum

fts5 tokenizer not able to detect end of parent token stream
Login
I'm building a tokenizer that lives in the tokenizer chain after `unicode61`. In my tokenizer I'd really like to know when the end of the token stream has been reached. I can almost do this by comparing `iEnd` (end of current token string) to `nText` (original input string length).

It doesn't work when the string ends in a character that would be stripped out by `unicode61` (like a space or a . character).

Similarly if I implement a stopwords tokenizer and the last word in the input string is a stopword it's no longer possible to detect the end of the stream.

It would be nice to have a new `tflags` bit `FTS5_TOKEN_FINAL` passed to `xToken` that tokenizers could set call to indicate the token is the last token in the input stream.

My current workaround is to search for unicode characters at the end of the string in my `xTokenize` and saving a modified `nText` value in my callback context. This means my tokenizer re-implements a lot of the `unicode61` tokenizer (or any other tokenizer that would change the input string length).