SQLite Forum

Bug: FTS5 Unicode61 Tokenizer doesn't recognize "ł" and "Ł" characters (Polish language)
Login
The problem is in the unicode definitions that we use to construct the tokenizer. The UnicodeData.txt entries for the upper and lower case version of that character are:

        0141;LATIN CAPITAL LETTER L WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER L SLASH;;;0142;
        0142;LATIN SMALL LETTER L WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER L SLASH;;0141;;0141

For other such characters used by European languages, unicode includes a mapping to the codepoints for the base character and diacritic. e.g. 

        013F;LATIN CAPITAL LETTER L WITH MIDDLE DOT;Lu;0;L;<compat> 004C 00B7;;;;N;;;;0140;
        0140;LATIN SMALL LETTER L WITH MIDDLE DOT;Ll;0;L;<compat> 006C 00B7;;;;N;;;013F;;013F

(the base characters for these two are 004C and 006C).

You could create your own tokenizer:

[](https://sqlite.org/fts5.html#custom_tokenizers)

Dan.