SQLite Forum

regexp for Unicode ranges
Login

regexp for Unicode ranges

(1) By Andrew Ziem (andrewz) on 2020-07-06 16:22:37 [link]

Is there a way with SQLite to match Unicode ranges such as Arabic, Greek, or Devanagari letters?


The following code returns zero rows

```
.load /usr/lib/sqlite3/pcre.so

select *
from (
select 'ascii' as name
union
select 'γλώσσα' as name
)
where name regexp '\p{Greek}';
```

The regex `[\u0621-\u064A0-9 ]` returns the error
`PCRE does not support \L, \l, \N{name}, \U, or \u (offset 3)`



I am using SQLite 3.31.1 on Ubuntu 20.04 with the package sqlite3-pcre version 0~git20070120091816+4229ecc-1.

(2) By anonymous on 2020-07-06 18:11:16 in reply to 1 [link]

Which PCRRE version are using. I've no problem with unicode like straße, München, Österreich and so on.

(3.1) By Warren Young (wyoung) on 2020-07-06 18:32:48 edited from 3.0 in reply to 2 [link]

Doesn’t the “git20070120091816” bit from the initial post suggest he’s trying to use code from January 2007?

(4) By Andrew Ziem (andrewz) on 2020-07-06 19:11:58 in reply to 2 [link]

Anonymous,

I have PCRE 8.39 based on this. The command `ldd /usr/lib/sqlite3/pcre.so` shows SQLite PCRE is linked against `/lib/x86_64-linux-gnu/libpcre.so.3` which comes from the package libpcre3 version 2:8.39-12build1.

You wrote you don't have a problem with straße, but I think we aren't talking about the same thing. I'm trying to match Unicode ranges. I can match the character ß by writing it, but I can't match all Greek characters using `\p{Greek}`

Here is another example

```
.load /usr/lib/sqlite3/pcre.so

.mode csv
select *, name regexp '\p{Greek}', name regexp '\p{Latin}', name regexp 'ß'
from (
select 'ascii' as name
union
select 'γλώσσα'
union
select 'straße'
);
```

I get these results

```
ascii,0,1,0
"straße",0,1,1
"γλώσσα",0,1,0
```

But I am expecting


```
ascii,0,1,0
"straße",0,1,1
"γλώσσα",1,1,0
```

Do you get the latter?

(5) By Andrew Ziem (andrewz) on 2020-07-06 19:14:33 in reply to 3.1 [link]

Related package versions are

sqlite3: 3.31.1-4ubuntu0.1
sqlite3-pcre: 0~git20070120091816+4229ecc-1
libpcre3: git200701200918163.31.1-4ubuntu0.1

Yes, sqlite3-pcre has not been updated in 13 years, but I think it seems to be just a simple interface between sqlite and libpcre with libpcre doing the hard work.

(6) By Warren Young (wyoung) on 2020-07-06 20:17:14 in reply to 5 [link]

According to [the docs][1], the availability of that feature of PCRE is gated by two Booleans:

> If PCRE is compiled with Unicode property support, and the PCRE_UCP option is set...


[1]: https://pcre.org/original/doc/html/pcrepattern.html#SEC5

(7) By anonymous on 2020-07-07 04:29:18 in reply to 4

Okay, I've tried it and I get the same result.

sqlite> .load /usr/lib/sqlite3/pcre
sqlite> .mode csv
sqlite> select *, name regexp '\p{Greek}', name regexp '\p{Latin}', name regexp 'ß'
   ...> from (
   ...> select 'ascii' as name
   ...> union
   ...> select 'γλώσσα'
   ...> union
   ...> select 'straße'
   ...> );
ascii,0,1,0
"straße",0,1,1
"γλώσσα",0,1,0

P.S. I use the latest version of SQLite (3.33) and I've compiled PCRE on my machine.
I think that PCRE is the first suspect to analyze.

So why is the unicode property causing an issue? I found this:
If PCRE is built with Unicode character property support (which implies UTF support), the escape sequences \p{..}, \P{..}, and \X can be used. The available properties that can be tested are limited to the general category properties such as Lu for an upper case letter or Nd for a decimal number, the Unicode script names such as Arabic or Han, and the derived properties Any and L&. Full lists is given in the pcrepattern and pcresyntax documentation. Only the short names for properties are supported. For example, \p{L} matches a letter. Its Perl synonym, \p{Letter}, is not supported. Furthermore, in Perl, many properties may optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE does not support this.

(8) By anonymous on 2020-07-07 09:00:56 in reply to 4 [link]

I pondered for a while… Is μ or Ω really a Greek letter? Wait a second before you object.
If I try the following:
sqlite> select 'Ω' regexp '\p{Latin}';
1
sqlite> 

So the PCRE library considers this to be a letter not belonging to the Greek alphabet.
Or is something else going on?
The syntax is \p{unicode property or \p{unicode script}

The property option I normally use for, for example, searching for a symbol like
\p{S} or \p{Sm} if you only want to test for mathematical symbols.

So let's try this:
 sqlite> select 'Ω' regexp '\p{S}';
1
sqlite> 
I was lucky. Because all other Greek letters are probably not a symbol.
sqlite> select '≤' regexp '\p{S}';
1
sqlite> 
Which is indeed a symbol.
I've never used or seen a colleague using /p{Greek} or /p{Latin}. What I've seen is using /p to do some freaky complex searching like, for example, \p{lu} search for an uppercase letter that has a lower case variant.