ext/fts3/fts3_unicode: not handle tokens that contain embedded nul characters

(1) By Xiaohui Zhang (zxh0420) on 2021-01-04 02:31:51 [link] [source]

In commit 95dca8d0c, fts5TriTokenize() in ext/fts5/fts5_tokenize.c was patched to prevent the trigram tokenizer from returning tokens that contain embedded nul characters. There is similar logic in unicodeNext() in ext/fts3/fts3_unicode.c, so I think there should be a check on iCode after READ_UTF8() too.

static int unicodeNext(
  sqlite3_tokenizer_cursor *pC,   /* Cursor returned by simpleOpen */
  const char **paToken,           /* OUT: Token text */
  int *pnToken,                   /* OUT: Number of bytes at *paToken */
  int *piStart,                   /* OUT: Starting offset of token */
  int *piEnd,                     /* OUT: Ending offset of token */
  int *piPos                      /* OUT: Position integer of token */
){
  ...
  while( z<zTerm ){
    READ_UTF8(z, zTerm, iCode);
    if( unicodeIsAlnum(p, (int)iCode) ) break;
    zStart = z;
  }
  if( zStart>=zTerm ) return SQLITE_DONE;

  zOut = pCsr->zToken;
  do {
    int iOut;

    /* Grow the output buffer if required. */
    if( (zOut-pCsr->zToken)>=(pCsr->nAlloc-4) ){
      char *zNew = sqlite3_realloc64(pCsr->zToken, pCsr->nAlloc+64);
      if( !zNew ) return SQLITE_NOMEM;
      zOut = &zNew[zOut - pCsr->zToken];
      pCsr->zToken = zNew;
      pCsr->nAlloc += 64;
    }

    /* Write the folded case of the last character read to the output */
    zEnd = z;
    iOut = sqlite3FtsUnicodeFold((int)iCode, p->eRemoveDiacritic);
    if( iOut ){
      WRITE_UTF8(zOut, iOut);
    }

    /* If the cursor is not at EOF, read the next character */
    if( z>=zTerm ) break;
    READ_UTF8(z, zTerm, iCode);
  }while( unicodeIsAlnum(p, (int)iCode) 
       || sqlite3FtsUnicodeIsdiacritic((int)iCode)
  );
  ...
}

(2) By Dan Kennedy (dan) on 2021-01-04 18:30:58 in reply to 1 [source]

Hi,

Does this actually cause a problem? As far as I can tell, fts4 unicode61 tokenizer just stops tokenizing at the first 0x00 byte in its input. Which might not be ideal in some cases, but seems to be internally consistent. The problem with the trigram tokenizer was that embedded 0x00 bytes caused integrity-check failures.

Thanks, Dan.