How to avoid 'row X missing from index' when using a custom collation?

(1) By BohwaZ (bohwaz) on 2021-06-01 19:22:58 [link] [source]

My app is used in France, and so we are using accents in French. My users want to be able to order a list of names without using the accents or the case, so that this list of users:

Mederic B
Méderic A
MÉdÉric C

is sorted like that:

Méderic A
Mederic B
MÉdÉric C

and when you are searching for mederic it will return all of the 3 entries.

My first solution was to use a custom function, eg. transliterate_to_ascii:

SELECT * FROM names WHERE transliterate_to_ascii(name) LIKE transliterate_to_ascii('%médéric%') ORDER BY transliterate_to_ascii(name);

But obviously this doesn't scale very well as that function would be called a lot, as I can't use indexes.

So my second solution was to use a custom collation.

But because my users must be able to download the database and explore it with any third-party SQLite tool, I couldn't use a custom-named collation, eg. french_unicode, and I used the "trick" of overriding the default NOCASE collation:

$db->createCollation('NOCASE', 'unicode_collation');

So now I can create indexes:

CREATE INDEX names_name ON names (names COLLATE NOCASE);

All went fine: I can give users of my app the ability to search regardless of accents and case, and I can have indexes so performance is good, and they can also use the database with third-party tools.

But when I'm doing a PRAGMA integrity_check; from any third-party app, I get error messages:

row 4 missing from index names_name
row 6 missing from index names_name
row 7 missing from index names_name

and so on. This does not happen when using my own app with the correct collation.

Not sure why this is happening, but most importantly is there a better way to do a "portable" database using custom collations that also handles indexes?

Thanks.

(2) By Keith Medcalf (kmedcalf) on 2021-06-01 19:47:06 in reply to 1 [link] [source]

Integrity check works (for this purpose) by scanning the table and then doing a B-Tree traversal of the index to find the entry for that row. If the row is not found by the time the traversal is complete, then you get an error message that the row is missing from the index (it was not found where it was supposed to be -- it may indeed be in the index, just not where it was expected to be found though this is not checked).

This is because the "collation function" is used to traverse the BTree. That is, assuming a very simple B-Tree, the "collation function" is used to determine if the sought key is "greater than" or "less than" the value stored at the current node and thus whether you "go left" or "go right" or the "current node" is the one sought. If the "collation function" indicates that the "current node" is supposed to be the one sought, then the key and the rowid are checked against the key and rowid in the original table. If these do not match, then an error message is thrown.

If the function used to do the comparison when "searching" the B-Tree does not provide consistent results with the function used to construct the B-Tree in the first place, then you will go right when you should go left and will fail to find the entry sought and an error message will be produced when you eventually run out of nodes without finding the one for which you are looking.

There is no solution to this problem other than to make the necessary collation sequence consistently available when accessing the B-Tree.

(5) By BohwaZ (bohwaz) on 2021-06-01 23:02:34 in reply to 2 [link] [source]

Thanks for the details, that explains a lot and makes a lot of sense :)

(3) By Keith Medcalf (kmedcalf) on 2021-06-01 19:56:12 in reply to 1 [link] [source]

My first solution was to use a custom function, eg. transliterate_to_ascii:

SELECT * FROM names WHERE transliterate_to_ascii(name) LIKE transliterate_to_ascii('%médéric%') ORDER BY transliterate_to_ascii(name);

But obviously this doesn't scale very well as that function would be called a lot, as I can't use indexes.

Well, actually you can. You can create an index on the function:

create index i0 on names(transliterate_to_ascii(name));

SQLite3 libraries compiled with SQLITE_ENABLE_UNKNOWN_SQL_FUNCTION=1 will be able to open and read the database. However, they will not be allowed to update it because that requires that the unknown SQL function be actually present.

(4) By Keith Medcalf (kmedcalf) on 2021-06-01 20:20:10 in reply to 3 [link] [source]

It should, of course, be pointed out that your query can use the index for the purpose of the order by only. The LIKE expression is a substring search so a full table/index scan is required (it begins with a wildcard).

The index could only be used if you changed the operation of LIKE to be case sensitive OR declared the index as case insensitive and repaired your query:

create index i0 on names(transliterate_to_ascii(name) collate nocase);

SELECT * FROM names WHERE transliterate_to_ascii(name) LIKE transliterate_to_ascii('médéric%') ORDER BY transliterate_to_ascii(name) collate nocase;

(6) By BohwaZ (bohwaz) on 2021-06-01 23:22:26 in reply to 4 [link] [source]

Yes of course my example was bad, the index can only be used in the order in that query, but that's the main use anyhow.

The idea of using a function in the index is nice, but I'm wondering what's the "least bad" thing between using the custom NOCASE collation and the function index:

NOCASE custom collation: integrity_check is broken, but database can be read in all instances, the index ordering actually works. BUT if you insert data to tables, you cannot match indexed columns using the custom collation (eg. INSERT INTO names (name) VALUES ('Élodie TEST'); SELECT * FROM names WHERE name = 'Élodie TEST'; will return nothing but SELECT * FROM names WHERE name = 'Élodie TEST' COLLATE BINARY; will work).
function index: integrity_check works, but the database may not be read in some instances, and data cannot be inserted unless indexes are dropped.

Thanks for your help :)

(8) By BohwaZ (bohwaz) on 2022-01-11 18:57:49 in reply to 6 [link] [source]

Thanks for the help, I finally found the issue, I could not understand how the ICU collator of PHP could return something that is no coherent.

And finally I found it: users had imported CSV files in the database, and these had non-UTF-8 characters, eg. Window-1252 and so on.

Because this was not valid UTF-8, the ICU collator did return false, and as a result the custom collation function returned incoherent results.

Fixing it with PHP was easy:

$str = !preg_match('//u', $str) ? utf8_encode($str) : $str;

on both strings before doing the collator comparison.

Thanks everyone :)

(9) By BohwaZ (bohwaz) on 2022-01-13 00:04:09 in reply to 8 [link] [source]

Also another advice: don't override a default collation (like NOCASE), as the result may lead to index corruption, even if you think that your custom collation is always available.

(7) By Simon Slavin (slavin) on 2021-06-01 23:35:23 in reply to 1 [source]

Add to your existing technique the. idea that each name should be stored twice: once with accents as input, and once in normalised form, in this case, transformed to ASCII and uppercased. Then you can do searching and sorting on the normalised column.