fts5 does not treat left half ring character as token

(1) By abu.battah (abu_battah) on 2020-08-06 02:37:54 [link] [source]

Hello I was reading the documentation:

It mentioned the default values for the Unicode61 tokenizer is:

The default value is "L* N* Co"

Notice the left half-ring character (ʿ) is found here in the Lm category which should be covered by the default L* category.

However when I run the following set of commands, notice the match fails:

SQLite version 3.32.3 2020-06-18 14:00:33
Enter ".help" for usage hints.
sqlite> CREATE TABLE "entities" (id INTEGER PRIMARY KEY, display_name TEXT);
sqlite> INSERT INTO entities (display_name) VALUES ('ʿAbd al-Muḥsin al-ʿAbbād');
sqlite> INSERT INTO entities (display_name) VALUES ('ʿAbd al-Raḥmān ʿAjjāl al-Lībī');
sqlite> CREATE VIRTUAL TABLE fts_idx USING fts5(display_name, content='entities', content_rowid='id');
sqlite> INSERT INTO fts_idx(fts_idx) VALUES('rebuild');
sqlite> SELECT * FROM fts_idx WHERE fts_idx MATCH 'Ajjal';

However, using the half-ring character in the search seems to find it:

sqlite> SELECT * FROM fts_idx WHERE fts_idx MATCH 'ʿAjjal';
ʿAbd al-Raḥmān ʿAjjāl al-Lībī
sqlite>

This seems like a bug. Am I doing something wrong?

(2) By Dan Kennedy (dan) on 2020-08-06 14:24:36 in reply to 1 [source]

If it's in category "Lm", then fts5 treats it as a token character (i.e. a letter). So the search term must include it as well.

I'm not literate in whichever language this is, but at first glance that does look sub-optimal. Maybe it should be stripped out before tokenization or something. It would be good to create a tokenizer that does better. The tricky bit is where to get the data - there are thousands of characters in unicode and we need to categorize them all. Suggestions welcome!

(3) By abu.battah (abu_battah) on 2020-08-06 14:54:18 in reply to 2 [link] [source]

The content is a transliteration of Arabic names, so there are general conventions that are followed:

This page shows a mapping of some conventions used.

I agree that it would be great to create a tokenizer for this. How about a tokenizer that follows a "whitelist" approach on what ranges are valid characters, and everything else that are ignored?

Maybe it should be stripped out before tokenization or something

Doesn't this mean that we would have to duplicate the content into another field and build the fts5 virtual table off of that column if I wanted to retain the original symbols? This sounds like it would be more space-consuming.

(4) By anonymous on 2020-08-06 15:22:56 in reply to 1 [link] [source]

I think the unicode61 tokenizer just strips marks from letters. The left half-ring character is a letter, not a mark, so it is left as part of the token. I have no idea if Unicode has it right or wrong, but it is what it is. You may want to process your special cases before feeding to FTS5. Do you know how does this work in Solr for example?

Martin

(5) By abu.battah (abu_battah) on 2021-01-26 22:14:39 in reply to 4 [link] [source]

Sorry for the late reply. I don't know how this works in Solr, but having to strip out the special cases before feeding to FTS5 would require a custom function at compile time right? Otherwise I fear there would be a lot of waste with data duplication by having to store the "sanitized" data into another column.