regexp for Unicode ranges
(1) By Andrew Ziem (andrewz) on 2020-07-06 16:22:37 [link]
Is there a way with SQLite to match Unicode ranges such as Arabic, Greek, or Devanagari letters? The following code returns zero rows ``` .load /usr/lib/sqlite3/pcre.so select * from ( select 'ascii' as name union select 'γλώσσα' as name ) where name regexp '\p{Greek}'; ``` The regex `[\u0621-\u064A0-9 ]` returns the error `PCRE does not support \L, \l, \N{name}, \U, or \u (offset 3)` I am using SQLite 3.31.1 on Ubuntu 20.04 with the package sqlite3-pcre version 0~git20070120091816+4229ecc-1.
(2) By anonymous on 2020-07-06 18:11:16 in reply to 1 [link]
Which PCRRE version are using. I've no problem with unicode like straße, München, Österreich and so on.
(3.1) By Warren Young (wyoung) on 2020-07-06 18:32:48 edited from 3.0 in reply to 2
Doesn’t the “git20070120091816” bit from the initial post suggest he’s trying to use code from January 2007?
(5) By Andrew Ziem (andrewz) on 2020-07-06 19:14:33 in reply to 3.1 [link]
Related package versions are sqlite3: 3.31.1-4ubuntu0.1 sqlite3-pcre: 0~git20070120091816+4229ecc-1 libpcre3: git200701200918163.31.1-4ubuntu0.1 Yes, sqlite3-pcre has not been updated in 13 years, but I think it seems to be just a simple interface between sqlite and libpcre with libpcre doing the hard work.
(6) By Warren Young (wyoung) on 2020-07-06 20:17:14 in reply to 5 [link]
According to [the docs][1], the availability of that feature of PCRE is gated by two Booleans: > If PCRE is compiled with Unicode property support, and the PCRE_UCP option is set... [1]: https://pcre.org/original/doc/html/pcrepattern.html#SEC5
(4) By Andrew Ziem (andrewz) on 2020-07-06 19:11:58 in reply to 2 [link]
Anonymous, I have PCRE 8.39 based on this. The command `ldd /usr/lib/sqlite3/pcre.so` shows SQLite PCRE is linked against `/lib/x86_64-linux-gnu/libpcre.so.3` which comes from the package libpcre3 version 2:8.39-12build1. You wrote you don't have a problem with straße, but I think we aren't talking about the same thing. I'm trying to match Unicode ranges. I can match the character ß by writing it, but I can't match all Greek characters using `\p{Greek}` Here is another example ``` .load /usr/lib/sqlite3/pcre.so .mode csv select *, name regexp '\p{Greek}', name regexp '\p{Latin}', name regexp 'ß' from ( select 'ascii' as name union select 'γλώσσα' union select 'straße' ); ``` I get these results ``` ascii,0,1,0 "straße",0,1,1 "γλώσσα",0,1,0 ``` But I am expecting ``` ascii,0,1,0 "straße",0,1,1 "γλώσσα",1,1,0 ``` Do you get the latter?
(7) By anonymous on 2020-07-07 04:29:18 in reply to 4 [link]
Okay, I've tried it and I get the same result. sqlite> .load /usr/lib/sqlite3/pcre sqlite> .mode csv sqlite> select *, name regexp '\p{Greek}', name regexp '\p{Latin}', name regexp 'ß' ...> from ( ...> select 'ascii' as name ...> union ...> select 'γλώσσα' ...> union ...> select 'straße' ...> ); ascii,0,1,0 "straße",0,1,1 "γλώσσα",0,1,0 P.S. I use the latest version of SQLite (3.33) and I've compiled PCRE on my machine. I think that PCRE is the first suspect to analyze. So why is the unicode property causing an issue? I found this: If PCRE is built with Unicode character property support (which implies UTF support), the escape sequences \p{..}, \P{..}, and \X can be used. The available properties that can be tested are limited to the general category properties such as Lu for an upper case letter or Nd for a decimal number, the Unicode script names such as Arabic or Han, and the derived properties Any and L&. Full lists is given in the pcrepattern and pcresyntax documentation. Only the short names for properties are supported. For example, \p{L} matches a letter. Its Perl synonym, \p{Letter}, is not supported. Furthermore, in Perl, many properties may optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE does not support this.
(8) By anonymous on 2020-07-07 09:00:56 in reply to 4 [link]
I pondered for a while… Is μ or Ω really a Greek letter? Wait a second before you object. If I try the following: sqlite> select 'Ω' regexp '\p{Latin}'; 1 sqlite> So the PCRE library considers this to be a letter not belonging to the Greek alphabet. Or is something else going on? The syntax is \p{unicode property or \p{unicode script} The property option I normally use for, for example, searching for a symbol like \p{S} or \p{Sm} if you only want to test for mathematical symbols. So let's try this: sqlite> select 'Ω' regexp '\p{S}'; 1 sqlite> I was lucky. Because all other Greek letters are probably not a symbol. sqlite> select '≤' regexp '\p{S}'; 1 sqlite> Which is indeed a symbol. I've never used or seen a colleague using /p{Greek} or /p{Latin}. What I've seen is using /p to do some freaky complex searching like, for example, \p{lu} search for an uppercase letter that has a lower case variant.