Regression in snippet() function in 3.44.0
(1) By Kovid Goyal (kovidgoyal) on 2023-11-03 08:19:10 [link] [source]
The snippet function is marking incorrect highlight extents in sqlite 3.44.0 Sample sqlite script
CREATE VIRTUAL TABLE fts_table USING fts5(t, tokenize = 'unicode61 remove_diacritics 2');
CREATE VIRTUAL TABLE fts_row USING fts5vocab(fts_table, row);
INSERT INTO fts_table(t) VALUES ('你dont叫mess');
SELECT term,doc FROM fts_row;
SELECT snippet(fts_table, 0, '>', '<', '...', 4) FROM fts_table WHERE fts_table MATCH '叫';
Output with sqlite < 3.44.0
dont|1
mess|1
你|1
叫|1
你dont>叫<mess
Output with sqlite 3.44.0
dont|1
mess|1
你|1
叫|1
你dont>叫mess<
Notice the trailing < is in the wrong position. Note that this script is not sufficient to reproduce on its own as it uses a custom tokenizer (unicode61 here is overriden in a custom sqlite extension). However, the output indicates tokenization is correct in both versions, so the issue must be in the snippet function.
The code of the tokenizer is here: https://github.com/kovidgoyal/calibre/blob/master/src/calibre/db/sqlite_extension.cpp
However its not standalone and depends on ICU and snowball stemmer etc. But since the tokenization is correct it shouldnt matter.
If there is some more information I can provide, please ask.
(2) By Dan Kennedy (dan) on 2023-11-03 17:21:28 in reply to 1 [link] [source]
Thanks for reporting this. Does it work after this change?
https://sqlite.org/src/info/8f5e9c192ff2820d
Thanks,
Dan.
(3) By Kovid Goyal (kovidgoyal) on 2023-11-04 06:27:53 in reply to 2 [source]
Yes, it does, thanks.
(4) By Kovid Goyal (kovidgoyal) on 2023-11-24 07:24:21 in reply to 2 [link] [source]
This bug is still present in 3.44.1 and I dont see a mention of it being fixed in the release notes.
(5) By Dan Kennedy (dan) on 2023-11-24 11:09:44 in reply to 4 [link] [source]
I don't think that one made the patch releases. It will be in 3.45.0 though.
Dan.