Query results change after adding index to UTF-16be database

(1) By Rashad Sookram (rsookram) on 2025-01-31 03:30:31 [source]

PRAGMA encoding = 'UTF-16be';

CREATE TABLE Example(word TEXT NOT NULL);
CREATE INDEX Example_word on Example(word);

INSERT INTO Example VALUES('み');

SELECT * FROM Example WHERE word GLOB 'み*';

In the example above, no rows were returned for the query, but I expected the 'み' row to be returned. I thought that the query would be equivalent to the following:

SELECT * FROM Example WHERE word >= 'み' AND word < 'む';

This query returned the row as expected however, so it looks like something went wrong for the GLOB query. This problem only happens when using UTF-16be and having an index on the column being searched. If either a different encoding is used for the database, or the index isn't added, the GLOB works as expected.

Seen in SQLite version 3.46.1 on Linux. I also tried on a recent commit and it was the same there.

(2) By Bo Lindbergh (_blgl_) on 2025-01-31 12:44:09 in reply to 1 [link] [source]

Here's the bug: SQLite generates the upper bound of the range by incrementing the last byte of the UTF-8 representation of the prefix. This generates garbage when that last byte is x'BF', but the garbage still works as long as the database uses UTF-8.

'み' in UTF-8 is x'E381BF'.

(3) By Dan Kennedy (dan) on 2025-01-31 15:56:28 in reply to 2 [link] [source]

That's it alright. Hopefully now fixed here:

https://sqlite.org/src/info/4b4f33d7

Dan.

(4) By Rashad Sookram (rsookram) on 2025-01-31 16:08:57 in reply to 3 [link] [source]

Thanks! I built the latest code and it works for me too.