ext/fts5/fts5_tokenize: not handle tokens that contain embedded nul characters
(1) By Xiaohui Zhang (zxh0420) on 2020-10-26 01:27:21 [source]
In commit 95dca8d0c, fts5TriTokenize()
in ext/fts5/fts5_tokenize.c was patched to prevent the trigram tokenizer from returning tokens that contain embedded nul characters. There is similar logic in fts5UnicodeTokenize()
, so I think there should be a check on iCode
after READ_UTF8()
too.
while( 1 ){
if( zCsr>=zTerm ) goto tokenize_done;
if( *zCsr & 0x80 ) {
/* A character outside of the ascii range. Skip past it if it is
** a separator character. Or break out of the loop if it is not. */
is = zCsr - (unsigned char*)pText;
READ_UTF8(zCsr, zTerm, iCode);
if( fts5UnicodeIsAlnum(p, iCode) ){
goto non_ascii_tokenchar;
}
}else{
if( a[*zCsr] ){
is = zCsr - (unsigned char*)pText;
goto ascii_tokenchar;
}
zCsr++;
}
}
(2) By Dan Kennedy (dan) on 2020-10-26 13:28:27 in reply to 1 [link] [source]
Thanks for reporting this. A unicode61 tokenizer configured to treat unicode "control-characters" (class Cc), was treating embedded nul characters as tokens. Which causes all manner of problems. Now fixed here: