fts5TriTokenize sometimes reads past the end of non-null terminated input strings

(1) By jnosh (sqlite_jnosh) on 2024-10-13 22:34:56 [link] [source]

Quoting from the xTokenize documentation:

This function is expected to tokenize the nText byte string indicated by argument pText. pText may or may not be nul-terminated.

fts5TriTokenize doesn't always handle non-null-terminated input strings correctly, reading past the end of the string and tokenizing the memory contents beyond the input string in some cases.

A very simple example is an empty string (nText=0, pText=<random pointer>) but it can also happen with non-empty strings under some circumstances.

From a cursory look at the source, fts5TriTokenize seems to rely on READ_UTF8 to check if it has hit the end of the input string which isn't sufficient and it needs to verify if it is already at the end of the input before using READ_UTF8.

READ_UTF8 has a check for zIn!=zTerm, where zTerm points to the end of the string and zIn points to the next byte to read. However, READ_UTF8 initially always unconditionally reads a byte from zIn and increments zIn when used. So if zIn == zTerm when READ_UTF8 is used, it will read past the end of the input and only stop once it hits a null-terminator.

#define READ_UTF8(zIn, zTerm, c)                           \
  c = *(zIn++);                                            \
  if( c>=0xc0 ){                                           \
    c = sqlite3Utf8Trans1[c-0xc0];                         \
    while( zIn!=zTerm && (*zIn & 0xc0)==0x80 ){            \
      c = (c<<6) + (0x3f & *(zIn++));                      \
    }                                                      \
    if( c<0x80                                             \
        || (c&0xFFFFF800)==0xD800                          \
        || (c&0xFFFFFFFE)==0xFFFE ){  c = 0xFFFD; }        \

(2.1) By Stephan Beal (stephan) on 2024-10-13 22:40:00 edited from 2.0 in reply to 1 [link] [source]

From a cursory look at the source,

For future reference: please always refer to a specific version. We often get reports from folks using arbitrarily old versions of source code which no longer match reality. The canonical source code is that hosted at src:/, not the (read-only) github clone (which none of this project members use).

when READ_UTF8 is used, it will read past the end of the input and only stop once it hits a null-terminator.

Can you confirm that it is ever called in such a way that that can happen? i.e. is this a genuine issue or a hypothetical one which requires that READ_UTF8 be used in an unexpected way? (Edit: i ask that because we often get reports of "security vulnerabilities" based on isolated analysis (sometimes LLM-driven) of functions which are never actually used in ways which could trigger those vulnerabilities.)

(3) By jnosh (sqlite_jnosh) on 2024-10-13 22:59:57 in reply to 2.0 [source]

refer to a specific version

Apologies. I encountered the issue with version 3.46.1 and the source I get from downloading the trunk archive seems to be identical in the affected area.

Can you confirm that it is ever called in such a way that that can happen? i.e. is this a genuine issue or a hypothetical one which requires that READ_UTF8 be used in an unexpected way?

I do not know if SQLite itself currently ever calls tokenizers with non-null-terminated strings. But since the API contract for xTokenize allows non-null-terminated strings it might in theory hit this case in the future even if it does not do so now.

xTokenize can also be called by application code which is where I ran into this. Specifically when writing test cases for tokenization in a higher level library that wraps SQLite. The tokenizer is invoked manually in that case to verify some tokenization test cases. Another use case where I could see this happen is when writing a custom wrapper tokenizer, where the wrapped tokenizer is also invoked directly by the client code and not SQLite.

(4) By Roger Binns (rogerbinns) on 2024-10-14 14:03:54 in reply to 3 [link] [source]

The xTokenize API can indeed easily be called by other code, and can be done so in my code base. The xTokenize API is passed bytes and there is no guarantee that they are valid UTF-8 or have any kind of termination.

The trigram tokenizer also stops at the first \0 which is a bit strange - for example unicode61 doesn't.

I agree with your analysis of how the read beyond end can happen, and it would indeed keep going until a \0 is encountered. At the very least the READ_UTF8 should use < not != to check for end of buffer.

zIn < zTerm

(5) By Stephan Beal (stephan) on 2024-10-15 13:16:38 in reply to 4 [link] [source]

At the very least the READ_UTF8 should use < not != to check for end of buffer.

FYI: Richard checked in two instances of that change last night.

(6) By jnosh (sqlite_jnosh) on 2024-10-26 11:02:22 in reply to 5 [link] [source]

Great, thank you, that should prevent the unbounded read past the end of the string!

Unless I missed another change when looking at trunk, I think this can still read one byte past the end of a string that is not null terminated.

fts5TriTokenize calls READ_UTF8 unconditionally and READ_UTF8 currently always reads the first byte of the target string. For example with an empty, non-null-terminated string, fts5TriTokenize would call READ_UTF8 which would then read one byte past the end of the string.

So I think there also needs to be an initial check if the string is already at the end so READ_UTF8 doesn't dereference at all in that case.

(7) By Stephan Beal (stephan) on 2024-11-12 07:56:37 in reply to 6 [link] [source]

So I think there also needs to be an initial check if the string is already at the end so READ_UTF8 doesn't dereference at all in that case.

FYI: Dan checked in a change for that yesterday.

(8) By jnosh (sqlite_jnosh) on 2024-11-12 10:41:01 in reply to 7 [link] [source]

Much appreciated!