Bug: FTS5 Unicode61 Tokenizer doesn't recognize "ł" and "Ł" characters (Polish language)
(1.1) By ilya (ilya.andreyuk) on 2020-05-11 22:15:51 edited from 1.0 [link] [source]
Hello!
I'm using 3.30.1. I think it's the only letter that's not recognized correctly in polish. So for example I added "Główna" to the index, I expect it to be found by "glow*" search term.
Thank you!
(2.1) By kierownik on 2021-01-10 23:32:01 edited from 2.0 in reply to 1.1 [link] [source]
And FTS4/FTS5 tokenize = "unicode61 remove_diacritics 1 or 2" does not help at all. It makes a horrible problem in querying.
(5) By anonymous on 2021-02-23 08:55:27 in reply to 2.1 [link] [source]
Well it's not the SQLite version. I am using 3.35 and got this from the FTS5 table: Samoczynne hamowanie poci?gu (polnisches ZugBesy) Which should have been: Samoczynne hamowanie pociągu (SHP) Not only the Ł and ł are effected. But also ɠ and ɠ.
(6) By anonymous on 2021-02-23 11:45:16 in reply to 2.1 [source]
I did another test. There is a normal SQLite table with a lot of town names. Here is the query: select * from LOCATION where loc_town REGEXP '[A-ZŁa-zł]'; Here is a fragment of the 6,000 rows in the result set: 52.534353 13.329043 80354118 8089118 BBEU Berlin Beusselstraße berlin beußelstrasse 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 47.811484 7.562759 80144154 8089119 RNBG Neuenburg Baden neuenburg baden 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 50.880221 6.085867 80151936 8089120 KXH Herzogenrath (grenze) herzogenrath grenze 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 51.736372 14.661882 80098558 8089122 BXFO Forst (grenze) forst grenze 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 53.415809 14.3734 80098509 8089123 WXG Grambow (grenze) grambow grenze 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 52.321174 14.576404 80098533 8089124 BXF Frankfurt Oder (grenze) frankfurt oder grenze 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 50.860058 14.22213 80098731 8089126 DXS Schöna (grenze) schöna grenze 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 53.325755 14.414598 80098517 8089127 WXT Tantow (grenze) tantow grenze 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 50.892428 14.829094 80098707 8089129 DXZ Zittau (grenze) zittau grenze 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 52.543253 13.368173 80319988 8089131 BWED Berlin Wedding berlin wedding 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 52.473101 13.455853 80354134 8089327 BSO Berlin Sonnenallee berlin sonnenallee 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 52.498738 13.269867 80354142 8089328 BMS Berlin Messe Süd Eichkamp berlin messe süd eichkamp 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 52.508123 13.259376 80030619 8089329 BHST Berlin Heerstraße berlin heerstraße 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 52.511026 13.241578 80030676 8089330 BOLS Berlin Olympiastadion berlin olympiastadion 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 52.510388 13.227132 80354159 8089331 BPIC Berlin Pichelsberg berlin pichelsberg 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 52.410761 13.308628 80354183 8089472 BLIS Berlin Lichterfelde Süd berlin lichterfelde süd 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 52.418842 13.314291 80198994 8089473 BOSS Berlin Osdorfer Straße berlin osdorfer straße 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 52.479375 13.352064 80030148 8089474 BSGR Berlin Schöneberg berlin schöneberg 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 52.486189 13.360927 80887562 8089537 BJLB Berlin Julius Leber Brücke berlin julius leber brücke 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 48.8659 8.509827 80117192 8090001 FFDF Ittersbach Rathaus ittersbach rathaus 80 0 2020-02-18 14:42:30 2020-02-18 14:42:30 48.608575 9.345716 80512285 8090021 TNU R Nürtingen Roßdorf There are a lot of rows in this set that should not have been part of the set. So lets try a different approach with a query that looks like this: select * from location where loc_town regexp '([A-ZŁ][a-zł])\w+'; Alas, the set still contains characters not defined by the regular expression. To avoid any discussion about Unicode and SQLite the database was defined with these settings: -- v0.21.01 New PRAGMA temp_store = MEMORY; PRAGMA synchronous = false; -- character encoding PRAGMA encoding = 'UTF-16le';
(8.1) By Keith Medcalf (kmedcalf) on 2021-02-23 20:18:58 edited from 8.0 in reply to 6 [link] [source]
The results are correct.
The expression loc_town REGEXP '[A-ZŁa-zł]'
will return true for any value of loc_town
which contains the specified character in any position within it.
The expression loc_town REGEXP '([A-ZŁ][a-zł])\w+'
will return true for any value of loc_town
which contains the specified characters followed by one or more words.
The appropriate REGEXP pattern to match (return true) for strings composed exclusively of [A-ZŁa-zł] characters and spaces (and is at least 1 character long) would be '^[A-ZŁa-zł ]+$'.
^ means start of field and $ means end of field
(9) By Keith Medcalf (kmedcalf) on 2021-02-23 20:38:22 in reply to 8.1 [link] [source]
If you want to use generic whitespace rather than ASCII spaces, you would use '^([A-ZŁa-zł]|\s)+$'
(3) By Dan Kennedy (dan) on 2021-01-11 13:32:04 in reply to 1.1 [link] [source]
The problem is in the unicode definitions that we use to construct the tokenizer. The UnicodeData.txt entries for the upper and lower case version of that character are:
0141;LATIN CAPITAL LETTER L WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER L SLASH;;;0142;
0142;LATIN SMALL LETTER L WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER L SLASH;;0141;;0141
For other such characters used by European languages, unicode includes a mapping to the codepoints for the base character and diacritic. e.g.
013F;LATIN CAPITAL LETTER L WITH MIDDLE DOT;Lu;0;L;<compat> 004C 00B7;;;;N;;;;0140;
0140;LATIN SMALL LETTER L WITH MIDDLE DOT;Ll;0;L;<compat> 006C 00B7;;;;N;;;013F;;013F
(the base characters for these two are 004C and 006C).
You could create your own tokenizer:
https://sqlite.org/fts5.html#custom_tokenizers
Dan.
(4) By kierownik on 2021-02-20 01:35:11 in reply to 3 [link] [source]
Thank you, Dan
This must be some horrrrrrible error, a monster bug in Unicode L Ł definition. Is there any workaround for Polish letter Ł (any case) on shared servers without "own tokenizer" support?
(7) By anonymous on 2021-02-23 16:47:26 in reply to 4 [link] [source]
If you think this is a bug in the Unicode standard you can report it at https://corp.unicode.org/reporting.html