Hidden noCase bug in ext/misc/regexp.c in pRe->zInit prefix optimization
(1.2) By andse-t610 on 2021-06-03 12:16:20 edited from 1.1 [link]
Hello! I'm trying to modify `regexp.c` extension for my application. So I added 3-args `regexp(re, text, noCase)` function. I'm noticed, that queries like > select regexp('abc', 'ABC', 1); > select regexp('ABC', 'ABC', 1) doesn't match. But `select regexp('ABC', 'abc', 1)` matches. I found out that it is because of the prefix optimiation in re_compile in lines [668-691](https://www.sqlite.org/cgi/src/file?ci=trunk&name=ext/misc/regexp.c&ln=668-692). Upper, when calling [re_subcompile_re at 651](https://www.sqlite.org/cgi/src/file?ci=trunk&name=ext/misc/regexp.c&ln=651), the text of the regular expression is written in lowercase, so the prefix is stored in lowercase. Then in re_match at line [217](https://www.sqlite.org/cgi/src/file?ci=trunk&name=ext/misc/regexp.c&ln=217) the prefix is compared with text *case-sensitive*. The simplest way to fix it - disable prefix optimization in noCase mode - modify line [676](https://www.sqlite.org/cgi/src/file?ci=trunk&name=ext/misc/regexp.c&ln=676) from > if( pRe->aOp[0]==RE_OP_ANYSTAR ){ to > if( !noCase && pRe->aOp[0]==RE_OP_ANYSTAR ){ ____
(2) By andse-t610 on 2021-06-03 12:15:53 in reply to 1.1
Another approach - modify lines [216-217](https://www.sqlite.org/cgi/src/file?ci=trunk&name=ext/misc/regexp.c&ln=216-217) in `re_match`. But there are three problems: * first char optimization `zIn[in.i]!=x` is case sensitive always. Simplest- disable this optimization in _noCase mode_ * `strcncmp` is case sensitive - we can use `sqlite3_strnicmp` in _noCase mode_ * (!) how to determine that current regexp compiled with noCase flag?
(3) By Richard Hipp (drh) on 2021-06-03 13:58:34 in reply to 1.2 [link]
This bug, plus one other, now fixed on [trunk][1]. I also added a new SQL function "regexpi(PATTERN,STRING)" that does case-independent REGEXP matching. And the regexp extension has been added to the [CLI][2]. [1]: https://www.sqlite.org/src/timeline?c=5d4535bfb603d7c8 [2]: https://www.sqlite.org/cli.html
(4) By andse-t610 on 2021-06-04 10:39:08 in reply to 3 [link]
Thank you for your lightning response! With such developers, sqlite is not in danger) A question along the way - is there any reason not to use flag `SQLITE_DETERMINISTIC` when declaring functions `regexp` and `regexpi`? what bad (or good or nothing) things can happen if I use it? I have found that without this flag i can't use `regexp` in the partial index (as the [docs](https://www.sqlite.org/c3ref/c_deterministic.html#sqlitedeterministic) says).
(6) By Richard Hipp (drh) on 2021-06-04 11:46:51 in reply to 4 [link]
A problem with using the SQLITE_DETERMINISTIC flag on REGEXP and thus permitting its use in the schema, for example in the WHERE clause of a partial index, is that different installations might easily use different definitions of REGEXP. The commonly used ICU extension defines a slightly different REGEXP operator, for example. And I think versions of REGEXP based on Perl regular expressions are in wide circulation. If you create an index using the built-in REGEXP and then some other application tries to use it using a different definition of REGEXP, then the index might not work right.
(7) By andse-t610 on 2021-06-04 12:49:55 in reply to 6 [link]
Clearly. I think we'll get the same problems if we change the collation logic on the existing database (like [here](https://sqlite.org/forum/forumpost/d1db925a1b?t=h)) All clear. Thanks a lot!
(5) By andse-t610 on 2021-06-04 11:06:22 in reply to 3 [link]
The function `regexpi` should be mentioned in the docs in lines [10-11](https://sqlite.org/src/file?name=ext/misc/regexp.c&ln=17-18)