SQLite Forum

regexp for Unicode ranges
Login

regexp for Unicode ranges

(1) By Andrew Ziem (andrewz) on 2020-07-06 16:22:37 [link] [source]

Is there a way with SQLite to match Unicode ranges such as Arabic, Greek, or Devanagari letters?

The following code returns zero rows

.load /usr/lib/sqlite3/pcre.so

select *
from (
select 'ascii' as name
union
select 'γλώσσα' as name
)
where name regexp '\p{Greek}';

The regex [\u0621-\u064A0-9 ] returns the error PCRE does not support \L, \l, \N{name}, \U, or \u (offset 3)

I am using SQLite 3.31.1 on Ubuntu 20.04 with the package sqlite3-pcre version 0~git20070120091816+4229ecc-1.

(2) By anonymous on 2020-07-06 18:11:16 in reply to 1 [link] [source]

Which PCRRE version are using. I've no problem with unicode like straße, München, Österreich and so on.

(3.1) By Warren Young (wyoung) on 2020-07-06 18:32:48 edited from 3.0 in reply to 2 [link] [source]

Doesn’t the “git20070120091816” bit from the initial post suggest he’s trying to use code from January 2007?

(4) By Andrew Ziem (andrewz) on 2020-07-06 19:11:58 in reply to 2 [link] [source]

Anonymous,

I have PCRE 8.39 based on this. The command ldd /usr/lib/sqlite3/pcre.so shows SQLite PCRE is linked against /lib/x86_64-linux-gnu/libpcre.so.3 which comes from the package libpcre3 version 2:8.39-12build1.

You wrote you don't have a problem with straße, but I think we aren't talking about the same thing. I'm trying to match Unicode ranges. I can match the character ß by writing it, but I can't match all Greek characters using \p{Greek}

Here is another example

.load /usr/lib/sqlite3/pcre.so

.mode csv
select *, name regexp '\p{Greek}', name regexp '\p{Latin}', name regexp 'ß'
from (
select 'ascii' as name
union
select 'γλώσσα'
union
select 'straße'
);

I get these results

ascii,0,1,0
"straße",0,1,1
"γλώσσα",0,1,0

But I am expecting

ascii,0,1,0
"straße",0,1,1
"γλώσσα",1,1,0

Do you get the latter?

(5) By Andrew Ziem (andrewz) on 2020-07-06 19:14:33 in reply to 3.1 [source]

Related package versions are

sqlite3: 3.31.1-4ubuntu0.1 sqlite3-pcre: 0~git20070120091816+4229ecc-1 libpcre3: git200701200918163.31.1-4ubuntu0.1

Yes, sqlite3-pcre has not been updated in 13 years, but I think it seems to be just a simple interface between sqlite and libpcre with libpcre doing the hard work.

(6) By Warren Young (wyoung) on 2020-07-06 20:17:14 in reply to 5 [link] [source]

According to the docs, the availability of that feature of PCRE is gated by two Booleans:

If PCRE is compiled with Unicode property support, and the PCRE_UCP option is set...

(7) By anonymous on 2020-07-07 04:29:18 in reply to 4 [link] [source]

Okay, I've tried it and I get the same result.

sqlite> .load /usr/lib/sqlite3/pcre
sqlite> .mode csv
sqlite> select *, name regexp '\p{Greek}', name regexp '\p{Latin}', name regexp 'ß'
   ...> from (
   ...> select 'ascii' as name
   ...> union
   ...> select 'γλώσσα'
   ...> union
   ...> select 'straße'
   ...> );
ascii,0,1,0
"straße",0,1,1
"γλώσσα",0,1,0

P.S. I use the latest version of SQLite (3.33) and I've compiled PCRE on my machine.
I think that PCRE is the first suspect to analyze.

So why is the unicode property causing an issue? I found this:
If PCRE is built with Unicode character property support (which implies UTF support), the escape sequences \p{..}, \P{..}, and \X can be used. The available properties that can be tested are limited to the general category properties such as Lu for an upper case letter or Nd for a decimal number, the Unicode script names such as Arabic or Han, and the derived properties Any and L&. Full lists is given in the pcrepattern and pcresyntax documentation. Only the short names for properties are supported. For example, \p{L} matches a letter. Its Perl synonym, \p{Letter}, is not supported. Furthermore, in Perl, many properties may optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE does not support this.

(8) By anonymous on 2020-07-07 09:00:56 in reply to 4 [link] [source]

I pondered for a while… Is μ or Ω really a Greek letter? Wait a second before you object.
If I try the following:
sqlite> select 'Ω' regexp '\p{Latin}';
1
sqlite> 

So the PCRE library considers this to be a letter not belonging to the Greek alphabet.
Or is something else going on?
The syntax is \p{unicode property or \p{unicode script}

The property option I normally use for, for example, searching for a symbol like
\p{S} or \p{Sm} if you only want to test for mathematical symbols.

So let's try this:
 sqlite> select 'Ω' regexp '\p{S}';
1
sqlite> 
I was lucky. Because all other Greek letters are probably not a symbol.
sqlite> select '≤' regexp '\p{S}';
1
sqlite> 
Which is indeed a symbol.
I've never used or seen a colleague using /p{Greek} or /p{Latin}. What I've seen is using /p to do some freaky complex searching like, for example, \p{lu} search for an uppercase letter that has a lower case variant.