Bug: FTS5 Unicode61 Tokenizer doesn't recognize "ł" and "Ł" characters (Polish language)

(1.1) By ilya (ilya.andreyuk) on 2020-05-11 22:15:51 edited from 1.0 [link]

Hello!

I'm using 3.30.1. I think it's the only letter that's not recognized correctly in polish. So for example I added "Główna" to the index, I expect it to be found by "glow*" search term.

Thank you!

(2.1) By kierownik on 2021-01-10 23:32:01 edited from 2.0 in reply to 1.1 [link]

And FTS4/FTS5 tokenize = "unicode61 remove_diacritics 1 or 2" does not help at all. It makes a horrible problem in querying.

(3) By Dan Kennedy (dan) on 2021-01-11 13:32:04 in reply to 1.1

The problem is in the unicode definitions that we use to construct the tokenizer. The UnicodeData.txt entries for the upper and lower case version of that character are:

        0141;LATIN CAPITAL LETTER L WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER L SLASH;;;0142;
        0142;LATIN SMALL LETTER L WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER L SLASH;;0141;;0141

For other such characters used by European languages, unicode includes a mapping to the codepoints for the base character and diacritic. e.g. 

        013F;LATIN CAPITAL LETTER L WITH MIDDLE DOT;Lu;0;L;<compat> 004C 00B7;;;;N;;;;0140;
        0140;LATIN SMALL LETTER L WITH MIDDLE DOT;Ll;0;L;<compat> 006C 00B7;;;;N;;;013F;;013F

(the base characters for these two are 004C and 006C).

You could create your own tokenizer:

[](https://sqlite.org/fts5.html#custom_tokenizers)

Dan.

(4) By kierownik on 2021-02-20 01:35:11 in reply to 3 [link]

Thank you, Dan

This must be some horrrrrrible error, a monster bug in Unicode L Ł definition. 
Is there any workaround for Polish letter Ł (any case) on shared servers without "own tokenizer" support?

(5) By anonymous on 2021-02-23 08:55:27 in reply to 2.1 [link]

Well it's not the SQLite version. I am using 3.35 and got this from the FTS5 table: Samoczynne hamowanie poci?gu (polnisches ZugBesy)

Which should have been: Samoczynne hamowanie pociągu (SHP)

Not only the Ł and ł are effected. But also ɠ and ɠ.

(6) By anonymous on 2021-02-23 11:45:16 in reply to 2.1 [link]

I did another test. There is a normal SQLite table with a lot of town names.

Here is the query: select * from LOCATION where loc_town REGEXP '[A-ZŁa-zł]';

Here is a fragment of the 6,000 rows in the result set:
52.534353      13.329043      80354118  8089118   BBEU       Berlin Beusselstraße                    berlin beußelstrasse                    80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
47.811484      7.562759       80144154  8089119   RNBG       Neuenburg Baden                         neuenburg baden                         80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
50.880221      6.085867       80151936  8089120   KXH        Herzogenrath (grenze)                   herzogenrath grenze                     80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
51.736372      14.661882      80098558  8089122   BXFO       Forst (grenze)                          forst grenze                            80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
53.415809      14.3734        80098509  8089123   WXG        Grambow (grenze)                        grambow grenze                          80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.321174      14.576404      80098533  8089124   BXF        Frankfurt Oder (grenze)                 frankfurt oder grenze                   80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
50.860058      14.22213       80098731  8089126   DXS        Schöna (grenze)                         schöna grenze                           80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
53.325755      14.414598      80098517  8089127   WXT        Tantow (grenze)                         tantow grenze                           80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
50.892428      14.829094      80098707  8089129   DXZ        Zittau (grenze)                         zittau grenze                           80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.543253      13.368173      80319988  8089131   BWED       Berlin Wedding                          berlin wedding                          80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.473101      13.455853      80354134  8089327   BSO        Berlin Sonnenallee                      berlin sonnenallee                      80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.498738      13.269867      80354142  8089328   BMS        Berlin Messe Süd Eichkamp               berlin messe süd eichkamp               80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.508123      13.259376      80030619  8089329   BHST       Berlin Heerstraße                       berlin heerstraße                       80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.511026      13.241578      80030676  8089330   BOLS       Berlin Olympiastadion                   berlin olympiastadion                   80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.510388      13.227132      80354159  8089331   BPIC       Berlin Pichelsberg                      berlin pichelsberg                      80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.410761      13.308628      80354183  8089472   BLIS       Berlin Lichterfelde Süd                 berlin lichterfelde süd                 80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.418842      13.314291      80198994  8089473   BOSS       Berlin Osdorfer Straße                  berlin osdorfer straße                  80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.479375      13.352064      80030148  8089474   BSGR       Berlin Schöneberg                       berlin schöneberg                       80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.486189      13.360927      80887562  8089537   BJLB       Berlin Julius Leber Brücke              berlin julius leber brücke              80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
48.8659        8.509827       80117192  8090001   FFDF       Ittersbach Rathaus                      ittersbach rathaus                      80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
48.608575      9.345716       80512285  8090021   TNU R      Nürtingen Roßdorf    

There are a lot of rows in this set that should not have been part of the set. So lets try a different approach with a query that looks like this:
select * from location where loc_town regexp '([A-ZŁ][a-zł])\w+';

Alas, the set still contains characters not defined by the regular expression. To avoid any discussion about Unicode and SQLite the database was defined with these settings:
-- v0.21.01 New
PRAGMA temp_store = MEMORY;
PRAGMA synchronous = false;

-- character encoding
PRAGMA encoding = 'UTF-16le';

(7) By anonymous on 2021-02-23 16:47:26 in reply to 4 [link]

If you think this is a bug in the Unicode standard you can report it at https://corp.unicode.org/reporting.html

(8.1) By Keith Medcalf (kmedcalf) on 2021-02-23 20:18:58 edited from 8.0 in reply to 6 [link]

The results are correct.

The expression `loc_town REGEXP '[A-ZŁa-zł]'` will return *true* for any value of `loc_town` which contains the specified character in any position within it.

The expression `loc_town REGEXP '([A-ZŁ][a-zł])\w+'` will return *true* for any value of `loc_town` which contains the specified characters followed by one or more words.

The appropriate REGEXP pattern to match (return *true*) for strings composed exclusively of [A-ZŁa-zł] characters and spaces (and is at least 1 character long) would be '^[A-ZŁa-zł ]+$'.

^ means *start of field* and $ means *end of field*

(9) By Keith Medcalf (kmedcalf) on 2021-02-23 20:38:22 in reply to 8.1 [link]

If you want to use *generic whitespace* rather than ASCII spaces, you would use `'^([A-ZŁa-zł]|\s)+$'`