Bug: FTS5 Unicode61 Tokenizer doesn't recognize "ł" and "Ł" characters (Polish language)

(1.1) By ilya (ilya.andreyuk) on 2020-05-11 22:15:51 edited from 1.0 [link] [source]

Hello!

I'm using 3.30.1. I think it's the only letter that's not recognized correctly in polish. So for example I added "Główna" to the index, I expect it to be found by "glow*" search term.

Thank you!

(2.1) By kierownik on 2021-01-10 23:32:01 edited from 2.0 in reply to 1.1 [source]

And FTS4/FTS5 tokenize = "unicode61 remove_diacritics 1 or 2" does not help at all. It makes a horrible problem in querying.

(5) By anonymous on 2021-02-23 08:55:27 in reply to 2.1 [link] [source]

Well it's not the SQLite version. I am using 3.35 and got this from the FTS5 table: Samoczynne hamowanie poci?gu (polnisches ZugBesy)

Which should have been: Samoczynne hamowanie pociągu (SHP)

Not only the Ł and ł are effected. But also ɠ and ɠ.

(6) By anonymous on 2021-02-23 11:45:16 in reply to 2.1 [link] [source]

I did another test. There is a normal SQLite table with a lot of town names.

Here is the query: select * from LOCATION where loc_town REGEXP '[A-ZŁa-zł]';

Here is a fragment of the 6,000 rows in the result set:
52.534353      13.329043      80354118  8089118   BBEU       Berlin Beusselstraße                    berlin beußelstrasse                    80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
47.811484      7.562759       80144154  8089119   RNBG       Neuenburg Baden                         neuenburg baden                         80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
50.880221      6.085867       80151936  8089120   KXH        Herzogenrath (grenze)                   herzogenrath grenze                     80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
51.736372      14.661882      80098558  8089122   BXFO       Forst (grenze)                          forst grenze                            80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
53.415809      14.3734        80098509  8089123   WXG        Grambow (grenze)                        grambow grenze                          80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.321174      14.576404      80098533  8089124   BXF        Frankfurt Oder (grenze)                 frankfurt oder grenze                   80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
50.860058      14.22213       80098731  8089126   DXS        Schöna (grenze)                         schöna grenze                           80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
53.325755      14.414598      80098517  8089127   WXT        Tantow (grenze)                         tantow grenze                           80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
50.892428      14.829094      80098707  8089129   DXZ        Zittau (grenze)                         zittau grenze                           80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.543253      13.368173      80319988  8089131   BWED       Berlin Wedding                          berlin wedding                          80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.473101      13.455853      80354134  8089327   BSO        Berlin Sonnenallee                      berlin sonnenallee                      80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.498738      13.269867      80354142  8089328   BMS        Berlin Messe Süd Eichkamp               berlin messe süd eichkamp               80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.508123      13.259376      80030619  8089329   BHST       Berlin Heerstraße                       berlin heerstraße                       80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.511026      13.241578      80030676  8089330   BOLS       Berlin Olympiastadion                   berlin olympiastadion                   80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.510388      13.227132      80354159  8089331   BPIC       Berlin Pichelsberg                      berlin pichelsberg                      80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.410761      13.308628      80354183  8089472   BLIS       Berlin Lichterfelde Süd                 berlin lichterfelde süd                 80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.418842      13.314291      80198994  8089473   BOSS       Berlin Osdorfer Straße                  berlin osdorfer straße                  80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.479375      13.352064      80030148  8089474   BSGR       Berlin Schöneberg                       berlin schöneberg                       80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.486189      13.360927      80887562  8089537   BJLB       Berlin Julius Leber Brücke              berlin julius leber brücke              80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
48.8659        8.509827       80117192  8090001   FFDF       Ittersbach Rathaus                      ittersbach rathaus                      80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
48.608575      9.345716       80512285  8090021   TNU R      Nürtingen Roßdorf    

There are a lot of rows in this set that should not have been part of the set. So lets try a different approach with a query that looks like this:
select * from location where loc_town regexp '([A-ZŁ][a-zł])\w+';

Alas, the set still contains characters not defined by the regular expression. To avoid any discussion about Unicode and SQLite the database was defined with these settings:
-- v0.21.01 New
PRAGMA temp_store = MEMORY;
PRAGMA synchronous = false;

-- character encoding
PRAGMA encoding = 'UTF-16le';

(8.1) By Keith Medcalf (kmedcalf) on 2021-02-23 20:18:58 edited from 8.0 in reply to 6 [link] [source]

The results are correct.

The expression loc_town REGEXP '[A-ZŁa-zł]' will return true for any value of loc_town which contains the specified character in any position within it.

The expression loc_town REGEXP '([A-ZŁ][a-zł])\w+' will return true for any value of loc_town which contains the specified characters followed by one or more words.

The appropriate REGEXP pattern to match (return true) for strings composed exclusively of [A-ZŁa-zł] characters and spaces (and is at least 1 character long) would be '^[A-ZŁa-zł ]+$'.

^ means start of field and $ means end of field

(9) By Keith Medcalf (kmedcalf) on 2021-02-23 20:38:22 in reply to 8.1 [link] [source]

If you want to use generic whitespace rather than ASCII spaces, you would use '^([A-ZŁa-zł]|\s)+$'

(3) By Dan Kennedy (dan) on 2021-01-11 13:32:04 in reply to 1.1 [link] [source]

The problem is in the unicode definitions that we use to construct the tokenizer. The UnicodeData.txt entries for the upper and lower case version of that character are:

    0141;LATIN CAPITAL LETTER L WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER L SLASH;;;0142;
    0142;LATIN SMALL LETTER L WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER L SLASH;;0141;;0141

For other such characters used by European languages, unicode includes a mapping to the codepoints for the base character and diacritic. e.g.

    013F;LATIN CAPITAL LETTER L WITH MIDDLE DOT;Lu;0;L;<compat> 004C 00B7;;;;N;;;;0140;
    0140;LATIN SMALL LETTER L WITH MIDDLE DOT;Ll;0;L;<compat> 006C 00B7;;;;N;;;013F;;013F

(the base characters for these two are 004C and 006C).

You could create your own tokenizer:

https://sqlite.org/fts5.html#custom_tokenizers

Dan.

(4) By kierownik on 2021-02-20 01:35:11 in reply to 3 [link] [source]

Thank you, Dan

This must be some horrrrrrible error, a monster bug in Unicode L Ł definition. Is there any workaround for Polish letter Ł (any case) on shared servers without "own tokenizer" support?

(7) By anonymous on 2021-02-23 16:47:26 in reply to 4 [link] [source]

If you think this is a bug in the Unicode standard you can report it at https://corp.unicode.org/reporting.html