Searching in text columns with Unicode folding

(1) By Wolfgang Oertl (w.oertl) on 2021-07-27 22:44:32 [link] [source]

I'm still struggling to achieve Unicode aware, case insensitive LIKE searching using an index. This isn't exotic and should be easy! Think about names containing letters with diaeresis in languages like Spanish, German, French, Turkish etc. The FAQ entry isn't quite satisfying.

A COLLATE NOCASE index will only handle ASCII. What I really would like to see is a COLLATE UNICODE (or something similar) that will simply do the right thing. Is there any chance of this happening (in the foreseeable future)? The necessary machinery seems to be already available in the FTS5 extension in the unicode61 tokenizer, see below.

The icu extension doesn't really help, because it replaces the SQL like() function which automatically disables the use of any index for the LIKE operator, regardless of the collation, so this will slow down with large data sets. As it affects every LIKE operator, not just specific ones, it might have unwanted side effects.

The FTS5 extension does Unicode folding. So I can create an FTS5 virtual table and use that for searching. I'm not sure about the overhead (less with the options detail=column and content/content_rowid), but at least it is usable right now. It makes the queries a little more complicated, though, as both the original table and the virtual table are involved.

Another option I see is to create "shadow" columns for the text fields in the same table and automatically fill them by generating and indexing it like this:

CREATE TABLE t1 (
  name TEXT NOT NULL,
  name2 TEXT NOT NULL AS (fold_unicode(name)) STORED
);
CREATE INDEX t1_name2 ON t1(name2 COLLATE NOCASE);
SELECT rowid, name FROM t1 WHERE name2 LIKE ?;

The missing piece is the folding function, which exists buried somewhere in the FTS5 extension. Unfortunately, it is not exposed as an SQL function. How about that? I could implement it myself, but it won't be as complete and the database won't be usable without loading the custom extension.

Are there any (simpler) other options I have overlooked?

(2) By Chris Brody (chrisbrody) on 2021-07-27 23:11:36 in reply to 1 [source]

FYI I made a very simple library for a good customer to add EU accent-enabled UPPER and LOWER functions: https://github.com/brodybits/sqlite3-eu

It is possible to use the library to either add separate EU-enabled UPPER & LOWER functions, like UPPER_EU & LOWER_EU, or overwrite the existing UPPER & LOWER functions.

For the sake of Unicode folding beginners like myself, you should be able to do the case-insensitive LIKE condition in a statement like this:

SELECT rowid, name FROM t1 WHERE LOWER_EU(name2) LIKE LOWER_EU(?)

(3) By Wolfgang Oertl (w.oertl) on 2021-07-28 16:15:32 in reply to 2 [link] [source]

Thanks for sharing. Unfortunately this query won't use an index, but perform a full table scan.