SQLite Forum

Bug: FTS5 Unicode61 Tokenizer doesn't recognize "ł" and "Ł" characters (Polish language)
Login
I did another test. There is a normal SQLite table with a lot of town names.

Here is the query: select * from LOCATION where loc_town REGEXP '[A-ZŁa-zł]';

Here is a fragment of the 6,000 rows in the result set:
52.534353      13.329043      80354118  8089118   BBEU       Berlin Beusselstraße                    berlin beußelstrasse                    80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
47.811484      7.562759       80144154  8089119   RNBG       Neuenburg Baden                         neuenburg baden                         80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
50.880221      6.085867       80151936  8089120   KXH        Herzogenrath (grenze)                   herzogenrath grenze                     80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
51.736372      14.661882      80098558  8089122   BXFO       Forst (grenze)                          forst grenze                            80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
53.415809      14.3734        80098509  8089123   WXG        Grambow (grenze)                        grambow grenze                          80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.321174      14.576404      80098533  8089124   BXF        Frankfurt Oder (grenze)                 frankfurt oder grenze                   80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
50.860058      14.22213       80098731  8089126   DXS        Schöna (grenze)                         schöna grenze                           80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
53.325755      14.414598      80098517  8089127   WXT        Tantow (grenze)                         tantow grenze                           80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
50.892428      14.829094      80098707  8089129   DXZ        Zittau (grenze)                         zittau grenze                           80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.543253      13.368173      80319988  8089131   BWED       Berlin Wedding                          berlin wedding                          80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.473101      13.455853      80354134  8089327   BSO        Berlin Sonnenallee                      berlin sonnenallee                      80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.498738      13.269867      80354142  8089328   BMS        Berlin Messe Süd Eichkamp               berlin messe süd eichkamp               80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.508123      13.259376      80030619  8089329   BHST       Berlin Heerstraße                       berlin heerstraße                       80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.511026      13.241578      80030676  8089330   BOLS       Berlin Olympiastadion                   berlin olympiastadion                   80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.510388      13.227132      80354159  8089331   BPIC       Berlin Pichelsberg                      berlin pichelsberg                      80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.410761      13.308628      80354183  8089472   BLIS       Berlin Lichterfelde Süd                 berlin lichterfelde süd                 80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.418842      13.314291      80198994  8089473   BOSS       Berlin Osdorfer Straße                  berlin osdorfer straße                  80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.479375      13.352064      80030148  8089474   BSGR       Berlin Schöneberg                       berlin schöneberg                       80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
52.486189      13.360927      80887562  8089537   BJLB       Berlin Julius Leber Brücke              berlin julius leber brücke              80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
48.8659        8.509827       80117192  8090001   FFDF       Ittersbach Rathaus                      ittersbach rathaus                      80               0                    2020-02-18 14:42:30  2020-02-18 14:42:30
48.608575      9.345716       80512285  8090021   TNU R      Nürtingen Roßdorf    

There are a lot of rows in this set that should not have been part of the set. So lets try a different approach with a query that looks like this:
select * from location where loc_town regexp '([A-ZŁ][a-zł])\w+';

Alas, the set still contains characters not defined by the regular expression. To avoid any discussion about Unicode and SQLite the database was defined with these settings:
-- v0.21.01 New
PRAGMA temp_store = MEMORY;
PRAGMA synchronous = false;

-- character encoding
PRAGMA encoding = 'UTF-16le';