SQLite Forum

ext/fts5/fts5_tokenize: not handle tokens that contain embedded nul characters
Login
In commit 95dca8d0c, `fts5TriTokenize()` in ext/fts5/fts5_tokenize.c was patched to prevent the trigram tokenizer from returning tokens that contain embedded nul characters. There is similar logic in `fts5UnicodeTokenize()`, so I think there should be a check on `iCode` after `READ_UTF8()` too.

```c
    while( 1 ){
      if( zCsr>=zTerm ) goto tokenize_done;
      if( *zCsr & 0x80 ) {
        /* A character outside of the ascii range. Skip past it if it is
        ** a separator character. Or break out of the loop if it is not. */
        is = zCsr - (unsigned char*)pText;
        READ_UTF8(zCsr, zTerm, iCode);
        if( fts5UnicodeIsAlnum(p, iCode) ){
          goto non_ascii_tokenchar;
        }
      }else{
        if( a[*zCsr] ){
          is = zCsr - (unsigned char*)pText;
          goto ascii_tokenchar;
        }
        zCsr++;
      }
    }
```