SQLite Forum

How to handle collation version changes
Login

How to handle collation version changes

(1) By David Matson (dmatson) on 2021-07-08 03:08:51 [link] [source]

According to the collation documentation, collating functions must obey some important constraints for SQLite to work correctly.

The collating function must obey the following properties for all strings A, B, and C:

If A==B then B==A.
If A==B and B==C then A==C.
If A<B THEN B>A.
If A<B and B<C then A<C.
If a collating function fails any of the above constraints and that collating function is registered and used, then the behavior of SQLite is undefined.

Do these properties have to hold for the life of the persisted .db file, or only during the process's lifetime?

Specifically, what if a collation needs to release a new version? For example, suppose Unicode defines a new code point, and the collation needs to sort it in the new-correct position for that now-defined code point. Is there way the collation can go through an upgrade process that is guaranteed not to break anything? If some users had started using the code point immediately after it was defined by the Unicode folks but before the collation upgrade was published, might the database file become unusable when the collation is upgraded?

Or, to be more concrete, suppose a collation uses Windows APIs such as CompareStringEx, which very occasionally changes (for example, on some Windows major version upgrades; whenever GetNLSVersionEx changes). Related documentation suggests re-indexing database strings to handle any changes in the new localization version. Is there a good way to do that with SQLite? If it's just an index, I suspect dropping and re-creating the index would work, but what if it's part of the table definition, such as:

CREATE TABLE UsesWindowsLocale(Name TEXT PRIMARY KEY COLLATE Unicode_Windows_en_US);

Where the collation does something like:

return CompareStringEx(
        L"en-US",
        0,
        (PCWCH)left,
        left_size / sizeof(wchar_t),
        (PCWCH)right,
        right_size / sizeof(wchar_t),
        NULL,
        NULL,
        0) - 2;

Is there a way to handle a Windows version upgrade that includes locale changes without dropping and re-creating the table?

If not, could/should there be? Perhaps some mechanism could allow telling "you had a valid collation before, and you have a new one now; please re-index" or even "here's the old and here's the new; please update the internal b-tree sorting accordingly."

Thanks,

David

(2) By Keith Medcalf (kmedcalf) on 2021-07-08 04:37:10 in reply to 1 [link] [source]

Do these properties have to hold for the life of the persisted .db file,

They have to hold for the duration of the existence of the index which uses that collation. A collation can change at any time that it is not in use. An index uses a collation sequence from the time it is created until the time it is dropped and during the interval the index exists the collation sequence is in use (whether or not anything is running) and the collation must not change.

For example, suppose Unicode defines a new code point, and the collation needs to sort it in the new-correct position for that now-defined code point. Is there way the collation can go through an upgrade process that is guaranteed not to break anything?

Yes, there is a command for that:
REINDEX

If some users had started using the code point immediately after it was defined by the Unicode folks but before the collation upgrade was published, might the database file become unusable when the collation is upgraded?

Yes.

Is there a good way to do that with SQLite?

Yes, there is a command for that.
REINDEX

Is there a way to handle a Windows version upgrade that includes locale changes without dropping and re-creating the table?

Yes, there is a command for that.
REINDEX

Perhaps some mechanism could allow telling "you had a valid collation before, and you have a new one now; please re-index" or even "here's the old and here's the new; please update the internal b-tree sorting accordingly."

There is a command to test the integrity of an index:
PRAGMA INTEGRITY_CHECK;

There is a command to rebuild defective indexes:
REINDEX

(3) By David Matson (dmatson) on 2021-07-08 23:06:19 in reply to 2 [link] [source]

Thanks, Keith! REINDEX sounds like exactly what I'm looking for here.

Just to confirm a couple of details:

There is a command to test the integrity of an index:

PRAGMA INTEGRITY_CHECK;

Compared to PRAGMA integrity_check, does PRAGMA quick_check skip checking the validity of the sort order per the collation? (Is skipping this step one of the things that makes it "quick"?)

They have to hold for the duration of the existence of the index which uses that collation.

I assume "index" here also includes "table" (if using a collation directly); is that correct?

A collation can change at any time that it is not in use.

...

Yes, there is a command for that.

REINDEX

I think this means that it's OK for a collation to change if and only if the very first thing done with the new collation is a REINDEX command - the data affected by this collation must not be queried or modified with the new collation before that command completes. I think it also means that it's OK for the collation to change completely - for example, it could even be backwards or otherwise entirely different from before - as long as the very first thing done with the DB file after loading the new collation is to issue the REINDEX command. And for the REINDEX command, the old version of the collation is not needed/completely irrelevant - it just needs to have the new version loaded. Is that understanding all correct?

Thanks,

David

(4) By Bill Wade (billwade) on 2021-07-09 13:18:25 in reply to 3 [link] [source]

I suspect that REINDEX can fail. Because of a collation change, a table that was previously valid for a UNIQUE index may have records that violate that constraint.

For a WITHOUT ROWID table, changing the collation of the primary key might involve shuffling the location on disk of almost every record. That might be much more expensive (time and space) than other REINDEX operations.

(5) By David Raymond (dvdraymond) on 2021-07-09 13:23:10 in reply to 3 [source]

Compared to PRAGMA integrity_check, does PRAGMA quick_check skip checking the validity of the sort order per the collation? (Is skipping this step one of the things that makes it "quick"?)

Yes. From the docs https://www.sqlite.org/pragma.html#pragma_quick_check

The pragma is like integrity_check except that it does not verify UNIQUE constraints and does not verify that index content matches table content. By skipping UNIQUE and index consistency checks, quick_check is able to run faster. PRAGMA quick_check runs in O(N) time whereas PRAGMA integrity_check requires O(NlogN) time where N is the total number of rows in the database. Otherwise the two pragmas are the same.

I tested it by inserting into a text field all the capitalization variants of 'ABC', then using pragma writable_schema to change it to text collate nocase. After closing and re-opening the database quick_check returned ok, but integrity_check found errors.

(6) By David Matson (dmatson) on 2021-07-09 21:27:58 in reply to 3 [link] [source]

I think it also means that it's OK for the collation to change completely - for example, it could even be backwards or otherwise entirely different from before... And for the REINDEX command, the old version of the collation is not needed/completely irrelevant - it just needs to have the new version loaded. Is that understanding all correct?

I confirmed this understanding offline.

Also, as Bill mentioned, a collation change could mean that two things that used to be distinct are now the same, no any UNIQUE constraint on a column with a collation could be violated by REINDEX, causing the REINDEX to fail.

So I think the only two points to be aware of here are:

  1. When changing the implementation of a collation, call REINDEX right after loading the new collation and before touching any data impacted by the new collation.
  2. Be aware that REINDEX with a UNIQUE constraint may fail if the collation change results in uniqueness changes.

(7) By Keith Medcalf (kmedcalf) on 2021-07-09 22:05:14 in reply to 3 [link] [source]

You will note that the REINDEX command can rebuild indexes for (1) the entire database (2) all indexes on a specific table (3) a specific index (4) all indexes that use the specified collation sequence.

https://sqlite.org/lang_reindex.html