SQLite Forum

ext/fts5/fts5_tokenize: not handle tokens that contain embedded nul characters
Login

ext/fts5/fts5_tokenize: not handle tokens that contain embedded nul characters

(1) By Xiaohui Zhang (zxh0420) on 2020-10-26 01:27:21 [source]

In commit 95dca8d0c, fts5TriTokenize() in ext/fts5/fts5_tokenize.c was patched to prevent the trigram tokenizer from returning tokens that contain embedded nul characters. There is similar logic in fts5UnicodeTokenize(), so I think there should be a check on iCode after READ_UTF8() too.

    while( 1 ){
      if( zCsr>=zTerm ) goto tokenize_done;
      if( *zCsr & 0x80 ) {
        /* A character outside of the ascii range. Skip past it if it is
        ** a separator character. Or break out of the loop if it is not. */
        is = zCsr - (unsigned char*)pText;
        READ_UTF8(zCsr, zTerm, iCode);
        if( fts5UnicodeIsAlnum(p, iCode) ){
          goto non_ascii_tokenchar;
        }
      }else{
        if( a[*zCsr] ){
          is = zCsr - (unsigned char*)pText;
          goto ascii_tokenchar;
        }
        zCsr++;
      }
    }

(2) By Dan Kennedy (dan) on 2020-10-26 13:28:27 in reply to 1 [link] [source]

Thanks for reporting this. A unicode61 tokenizer configured to treat unicode "control-characters" (class Cc), was treating embedded nul characters as tokens. Which causes all manner of problems. Now fixed here:

https://sqlite.org/src/info/b7b7bde9b7a03665