SQLite

View Ticket
Login
2020-03-05
16:26 Fixed ticket [1b8d7264]: MAX yields unexpected result for UTF-16 plus 6 other changes (artifact: bf4e521e user: drh)
16:13
Change the sqlite3.pDfltColl (the default collating sequence for the database connection) so that it is the collating sequence appropriate for the database encoding, not the UTF8 collating sequence. This helps to ensure that the database encoding collation is always used, even for expressions that do not have an defined collating sequence. Ticket [1b8d7264567eb6fc]. (check-in: 4a585189 user: drh tags: trunk)
2020-03-04
22:56 New ticket [1b8d7264] MAX yields unexpected result for UTF-16. (artifact: f1ec275c user: mrigger)

Ticket Hash: 1b8d7264567eb6fc9a6149757a94d58e365ee704
Title: MAX yields unexpected result for UTF-16
Status: Fixed Type: Code_Defect
Severity: Minor Priority: Low
Subsystem: Unknown Resolution: Fixed
Last Modified: 2020-03-05 16:26:30
Version Found In: 3.32.0
User Comments:
mrigger added on 2020-03-04 22:56:29:

Consider the following statements:

PRAGMA encoding = 'UTF-16';
CREATE TABLE t0(c0 TEXT);
INSERT INTO t0(c0) VALUES ('윆'), (1);
SELECT MAX(CASE 1 WHEN 1 THEN t0.c0 END) FROM t0; -- 윆
SELECT MAX(t0.c0) FROM t0; -- 1

I would expect both queries to produce the same result, however, the first returns '윆', while the second returns 1. When using UTF-8, both queries return '윆'.


drh added on 2020-03-05 16:26:30:

The default sort order for text is whatever order the text is rendered in using memcmp() as a comparison function. That order is different depending on the text encoding. UTF-8 is mostly the same as UTF-16BE, except for a few corner cases involving surrogate pairs. But UTF-16LE is quite different. (This surprising fact about text sort order needs to be discussed on the "quirks" documentation page.)

The fact that the sort order of text is different depending on the text encoding is less of a problem than it might seem at first because:

  1. Almost all SQLite databases use the UTF8 encoding. The UTF16 encodings are seldom used. Even on Android which uses UTF16LE encoding for the Java applications, the text is usually converted to UTF8 prior to being inserted into SQLite databases.
  1. In western countries, at least, most text is ASCII, and ASCII text always sorts in the same order regardless of the encoding.
  1. Most text comparisons are for equality, and memcmp() always works for equality comparisons regardless of the text encoding.

The problem reported by this ticket arose because SQLite was using a UTF8 collating sequence for some, but not all, expressions in a database that had a UTF16LE text encoding. The fix was to ensure that the same collating sequence is used for *all* comparisons within the same database.

This ticket is labeled "minor" in severity because of the observation that it impacts hardly any databases in the wild.