REGEXP() Issues: matching with Japanese characters/ matching "End of String" with $

(1.1) By turtlewilly on 2022-07-03 14:19:37 edited from 1.0 [source]

Hi there,

we're investigating a few strange usability issues in one of our apps. After long research we finally tracked it down to sqlite and how it somewhat seems to have unexpected results (or rather none) when Japanese characters (CJK?) are involved. Testing the same situations with non-Japanese strings (ASCII, Greek, Cyrillic, German, etc.) works normally, but with Japanese it fails in rather strange ways and returning no results.

A somewhat unrelated issue involves matching with $ for "end of string". This typically breaks entirely.

Maybe someone could help me to understand the reasons? I appreciate any help and insight into this issue in advance. I hope the examples are self-explanatory, but not too complex/ messy either. Thank you very much!

Note: we tested this with up-to-date builds on Linux and macOS, also with an older Linux-based build (used by the app) and some matching external regexp module, all build from official sources. Also switching the character set to UTF16 didn't change a thing (but not 100% sure switching with PRAGMA encoding="UTF-16le"; actually did anything, so there's some question mark there.)

Build

$ gcc -DHAVE_READLINE -DSQLITE_ENABLE_EXPLAIN_COMMENTS -DHAVE_USLEEP \
   -DSQLITE_ENABLE_RTREE -DSQLITE_ENABLE_JSON1 -DSQLITE_TEMP_STORE=3 \
   shell.c sqlite3.c -o sqlite3regex /tmp/lib/libreadline.a \
   /tmp/lib/libhistory.a -ltermcap

Issue #1: Japanese Characters

Prepare

$ sqlite3
SQLite version 3.39.0 2022-06-25 14:57:57
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE TESTTABLE(ID INTEGER PRIMARY KEY AUTOINCREMENT, TESTSTRING TEXT NOT NULL);
INSERT INTO TESTTABLE VALUES(1,'foobar');
INSERT INTO TESTTABLE VALUES(2,'example');
INSERT INTO TESTTABLE VALUES(3,'Ελλάδα');
INSERT INTO TESTTABLE VALUES(4,'монгол хэл');
INSERT INTO TESTTABLE VALUES(5,'Übergrößengeschäft');
INSERT INTO TESTTABLE VALUES(6,'日本語');
COMMIT;

Example #1

Straight full 1:1 match, fails for the Japanese example string

> SELECT * FROM TESTTABLE WHERE regexp('foobar', TESTSTRING);
1|foobar
> SELECT * FROM TESTTABLE WHERE regexp('Ελλάδα', TESTSTRING);
3|Ελλάδα
> SELECT * FROM TESTTABLE WHERE regexp('монгол хэл', TESTSTRING);
4|монгол хэл
> SELECT * FROM TESTTABLE WHERE regexp('Übergrößengeschäft', TESTSTRING);
5|Übergrößengeschäft
> SELECT * FROM TESTTABLE WHERE regexp('日本語', TESTSTRING);
<no results>

Example #2

Matching start of string makes it work for Japanese too:

> SELECT * FROM TESTTABLE WHERE regexp('^foobar', TESTSTRING);
1|foobar
> SELECT * FROM TESTTABLE WHERE regexp('^Ελλάδα', TESTSTRING);
3|Ελλάδα
> SELECT * FROM TESTTABLE WHERE regexp('^монгол хэл', TESTSTRING);
4|монгол хэл
> SELECT * FROM TESTTABLE WHERE regexp('^Übergrößengeschäft', TESTSTRING);
5|Übergrößengeschäft
> SELECT * FROM TESTTABLE WHERE regexp('^日本語', TESTSTRING);
6|日本語

Example #3

Adding a "OR case" makes it work too

> SELECT * FROM TESTTABLE WHERE regexp('日本語|dummy', TESTSTRING);
6|日本語
> SELECT * FROM TESTTABLE WHERE regexp('日本語|日本語', TESTSTRING);
6|日本語

Example #4

But having to match more doesn't work again

> SELECT * FROM TESTTABLE WHERE regexp('^.*foobar', TESTSTRING);
1|foobar
> SELECT * FROM TESTTABLE WHERE regexp('^.*Ελλάδα', TESTSTRING);
3|Ελλάδα
> SELECT * FROM TESTTABLE WHERE regexp('^.*монгол хэл', TESTSTRING);
4|монгол хэл
> SELECT * FROM TESTTABLE WHERE regexp('^.*Übergrößengeschäft', TESTSTRING);
5|Übergrößengeschäft
> SELECT * FROM TESTTABLE WHERE regexp('^.*日本語', TESTSTRING);
<no results>

Example #5

Just matching part of the Japanese string, fails too

> SELECT * FROM TESTTABLE WHERE regexp('^.*本語', TESTSTRING);
<no results>
> SELECT * FROM TESTTABLE WHERE regexp('^.*語', TESTSTRING);
<no results>

Example #6

But like this it works again

> SELECT * FROM TESTTABLE WHERE regexp('.*日本語', TESTSTRING);
6|日本語
> SELECT * FROM TESTTABLE WHERE regexp('.*語', TESTSTRING);
6|日本語

Issue #2: Matching "End of String" with $ not at the end of the regular expression

Another, somewhat unrelated issue, matching "end of string" in complex situations seems to be broken too, e.g. "starts with 'foo' or ends with 'ample'":

> SELECT * FROM TESTTABLE WHERE regexp('ample$|^foo', TESTSTRING);
Runtime error: unrecognized character

Prepare

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE TESTTABLE(ID INTEGER PRIMARY KEY AUTOINCREMENT, TESTSTRING TEXT NOT NULL);
INSERT INTO TESTTABLE VALUES(1,'foobar #1');
INSERT INTO TESTTABLE VALUES(2,'foobar #2');
INSERT INTO TESTTABLE VALUES(3,'foobar #3');
INSERT INTO TESTTABLE VALUES(4,'foobar #10');
INSERT INTO TESTTABLE VALUES(5,'foobar #11');
INSERT INTO TESTTABLE VALUES(6,'foobar #200');
INSERT INTO TESTTABLE VALUES(7,'foobar #1000');
COMMIT;

Example

Goal to match "1", and "200", but not "10" or "1000", this won't do obviously:

> SELECT * FROM TESTTABLE WHERE regexp('#(1|200)', TESTSTRING);
1|foobar #1
4|foobar #10
5|foobar #11
6|foobar #200
7|foobar #1000

But limiting "1" to the end of the string should do, but it breaks:

> SELECT * FROM TESTTABLE WHERE regexp('#(1$|200)', TESTSTRING);
Runtime error: unmatched '('

(2) By Richard Hipp (drh) on 2022-07-03 14:27:42 in reply to 1.0 [link] [source]

Thanks for the bug report.

Please try again using the one-character fix to "regexp.c" shown at check-in c94595a6e15490b4 and let us know if that fixes your problem.

(3) By turtlewilly on 2022-07-03 14:49:06 in reply to 2 [link] [source]

Thank you very much for your reply. Yes, that change seems to fix the issue #1 nicely from what I can tell so far.

(4) By Richard Hipp (drh) on 2022-07-03 15:04:47 in reply to 3 [link] [source]

I didn't even notice issue #2. Why don't you start a separate forum thread for that.

That's a good policy, in general, for this forum and for any other. One topic per thread.

(5) By turtlewilly on 2022-07-03 16:45:21 in reply to 4 [link] [source]

Sorry about that. For us this started out as a singular issue, it actually just got separated a bit during formatting of the initial post. I was rather unsure if the issues are related or not, as e.g. adding ^ (the opposite function of $) could change the results too. When I tested $ with Japanese strings in that context it just started error'ing out completely. Related? Unrelated? One issue, or two? I have no idea about the inner workings of sqlite, before investigating this I never even used sqlite, in fact.

Now I'm also a bit unsure, if your question was just rhetorical or do you actually want me to open a new thread and copy the information over?

But back to issue #1 actually. Since you explicitly mentioned 3 byte characters it actually got me thinking in the meantime about 4 byte characters. UTF8 can go up to 6, I think? But usage typically sticks with max. 4 these days, I read earlier. This probably has little practical impact in our case, even I think some CJK characters are in that 4 byte range. But, f.ex. emojis are rather common these days and results are clearly still a bit off:

sqlite> INSERT INTO TESTTABLE VALUES(7,'😎😂😁');
sqlite> SELECT * FROM TESTTABLE WHERE regexp('😎', TESTSTRING);
<no results>
sqlite> SELECT * FROM TESTTABLE WHERE regexp('.*😎', TESTSTRING);
7|😎😂😁
sqlite> SELECT * FROM TESTTABLE WHERE regexp('^.*😎', TESTSTRING);
<no results>
sqlite> SELECT * FROM TESTTABLE WHERE regexp('^😎', TESTSTRING);
<no results>

This is with your earlier fix applied.

Thanks again for looking into the issues! Greatly appreciated.

(6) By Stephan Beal (stephan) on 2022-07-03 17:28:27 in reply to 5 [link] [source]

Now I'm also a bit unsure, if your question was just rhetorical or do you actually want me to open a new thread and copy the information over?

It wasn't rhetorical. It's generally easier to process issue reports if there's only one distinct issue per report. In a long report like yours it's easy to miss that the trailing part is a separate issue.

Since you explicitly mentioned 3 byte characters it actually got me thinking in the meantime about 4 byte characters. UTF8 can go up to 6, I think?

It used to, but is currently limited to 4 bytes. According to Wikipedia:

In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

(7) By Richard Hipp (drh) on 2022-07-03 18:13:20 in reply to 1.1 [link] [source]

An enhancement to implement the functionality in "Issue #2" is now on trunk.

(8) By turtlewilly on 2022-07-04 14:23:42 in reply to 7 [link] [source]

Thank you very much, Richard! Quick work! I build the trunk version and tested it. Your function enhancement seems to work very well for the various test cases we had that triggered an error previously.

Would handling for "^" require the very same enhancement? It doesn't seem to error out like $ previously… just doesn't return results it seems:

BEGIN TRANSACTION;
CREATE TABLE foo(t TEXT NOT NULL);
INSERT INTO foo (t) VALUES ('1'), ('2'), ('3'), ('4'), ('5'), ('6'), ('7'), ('8'), ('9');
COMMIT;
SELECT * FROM foo WHERE REGEXP('1|^7|8$|6',t);
1
6
8

("7" is missing in the result set)