SQLite Forum

Hidden noCase bug in ext/misc/regexp.c in pRe->zInit prefix optimization
Login

Hidden noCase bug in ext/misc/regexp.c in pRe->zInit prefix optimization

(1) By andse-t610 on 2021-06-01 23:21:30 updated by 1.1 [link] [source]

Hello! I'm trying to modify `regexp.c` extension for my application. So I added 3-args `regexp(re, text, noCase)` function.

I'm noticed, that queries like  

> select regexp('abc', 'ABC', 1);

> select regexp('ABC', 'ABC', 1)

doesn't match.

But `select regexp('ABC', 'abc', 1)` matches.


I found out that it is because of the prefix optimiation in re_compile in lines [668-691](https://www.sqlite.org/cgi/src/file?ci=trunk&name=ext/misc/regexp.c&ln=668-692). We can see that the prefix stored as is (no case modification).

Then in re_match at line [212](https://www.sqlite.org/cgi/src/file?ci=trunk&name=ext/misc/regexp.c&ln=217) the prefix is compared case-sensitive.

The simplest way to fix it - disable prefix optimization in noCase mode - modify line [676](https://www.sqlite.org/cgi/src/file?ci=trunk&name=ext/misc/regexp.c&ln=676) from

> if( pRe->aOp[0]==RE_OP_ANYSTAR ){

to

> if( !noCase && pRe->aOp[0]==RE_OP_ANYSTAR ){




____

Hidden noCase bug in ext/misc/regexp.c in pRe->zInit prefix optimization

(1.1) By andse-t610 on 2021-06-03 10:27:00 edited from 1.0 updated by 1.2 [link] [source]

Hello! I'm trying to modify `regexp.c` extension for my application. So I added 3-args `regexp(re, text, noCase)` function.

I'm noticed, that queries like  

> select regexp('abc', 'ABC', 1);

> select regexp('ABC', 'ABC', 1)

doesn't match.

But `select regexp('ABC', 'abc', 1)` matches.


I found out that it is because of the prefix optimiation in re_compile in lines [668-691](https://www.sqlite.org/cgi/src/file?ci=trunk&name=ext/misc/regexp.c&ln=668-692). 

Upper, when calling [re_subcompile_re at 651](https://www.sqlite.org/cgi/src/file?ci=trunk&name=ext/misc/regexp.c&ln=651), the text of the regular expression is written in lowercase, so the prefix is stored in lowercase.


Then in re_match at line [212](https://www.sqlite.org/cgi/src/file?ci=trunk&name=ext/misc/regexp.c&ln=217) the prefix is compared with text *case-sensitive*.

The simplest way to fix it - disable prefix optimization in noCase mode - modify line [676](https://www.sqlite.org/cgi/src/file?ci=trunk&name=ext/misc/regexp.c&ln=676) from

> if( pRe->aOp[0]==RE_OP_ANYSTAR ){

to

> if( !noCase && pRe->aOp[0]==RE_OP_ANYSTAR ){




____

(2) By andse-t610 on 2021-06-03 12:15:53 in reply to 1.1 [link] [source]

Another approach - modify lines 216-217 in re_match.

But there are three problems:

  • first char optimization zIn[in.i]!=x is case sensitive always. Simplest- disable this optimization in noCase mode
  • strcncmp is case sensitive - we can use sqlite3_strnicmp in noCase mode
  • (!) how to determine that current regexp compiled with noCase flag?

Hidden noCase bug in ext/misc/regexp.c in pRe->zInit prefix optimization

(1.2) By andse-t610 on 2021-06-03 12:16:20 edited from 1.1 [link] [source]

Hello! I'm trying to modify regexp.c extension for my application. So I added 3-args regexp(re, text, noCase) function.

I'm noticed, that queries like

select regexp('abc', 'ABC', 1);

select regexp('ABC', 'ABC', 1)

doesn't match.

But select regexp('ABC', 'abc', 1) matches.

I found out that it is because of the prefix optimiation in re_compile in lines 668-691.

Upper, when calling re_subcompile_re at 651, the text of the regular expression is written in lowercase, so the prefix is stored in lowercase.

Then in re_match at line 217 the prefix is compared with text case-sensitive.

The simplest way to fix it - disable prefix optimization in noCase mode - modify line 676 from

if( pRe->aOp[0]==RE_OP_ANYSTAR ){

to

if( !noCase && pRe->aOp[0]==RE_OP_ANYSTAR ){


(3) By Richard Hipp (drh) on 2021-06-03 13:58:34 in reply to 1.2 [source]

This bug, plus one other, now fixed on trunk. I also added a new SQL function "regexpi(PATTERN,STRING)" that does case-independent REGEXP matching. And the regexp extension has been added to the CLI.

(4) By andse-t610 on 2021-06-04 10:39:08 in reply to 3 [link] [source]

Thank you for your lightning response! With such developers, sqlite is not in danger)

A question along the way - is there any reason not to use flag SQLITE_DETERMINISTIC when declaring functions regexp and regexpi?

what bad (or good or nothing) things can happen if I use it?

I have found that without this flag i can't use regexp in the partial index (as the docs says).

(5) By andse-t610 on 2021-06-04 11:06:22 in reply to 3 [link] [source]

The function regexpi should be mentioned in the docs in lines 10-11

(6) By Richard Hipp (drh) on 2021-06-04 11:46:51 in reply to 4 [link] [source]

A problem with using the SQLITE_DETERMINISTIC flag on REGEXP and thus permitting its use in the schema, for example in the WHERE clause of a partial index, is that different installations might easily use different definitions of REGEXP. The commonly used ICU extension defines a slightly different REGEXP operator, for example. And I think versions of REGEXP based on Perl regular expressions are in wide circulation. If you create an index using the built-in REGEXP and then some other application tries to use it using a different definition of REGEXP, then the index might not work right.

(7) By andse-t610 on 2021-06-04 12:49:55 in reply to 6 [link] [source]

Clearly. I think we'll get the same problems if we change the collation logic on the existing database (like here)

All clear. Thanks a lot!