Hidden noCase bug in ext/misc/regexp.c in pRe->zInit prefix optimization
(1.2) By andse-t610 on 2021-06-03 12:16:20 edited from 1.1 [source]
Hello! I'm trying to modify regexp.c
extension for my application. So I added 3-args regexp(re, text, noCase)
function.
I'm noticed, that queries like
select regexp('abc', 'ABC', 1);
select regexp('ABC', 'ABC', 1)
doesn't match.
But select regexp('ABC', 'abc', 1)
matches.
I found out that it is because of the prefix optimiation in re_compile in lines 668-691.
Upper, when calling re_subcompile_re at 651, the text of the regular expression is written in lowercase, so the prefix is stored in lowercase.
Then in re_match at line 217 the prefix is compared with text case-sensitive.
The simplest way to fix it - disable prefix optimization in noCase mode - modify line 676 from
if( pRe->aOp[0]==RE_OP_ANYSTAR ){
to
if( !noCase && pRe->aOp[0]==RE_OP_ANYSTAR ){
(2) By andse-t610 on 2021-06-03 12:15:53 in reply to 1.1 [link] [source]
Another approach - modify lines 216-217 in re_match
.
But there are three problems:
- first char optimization
zIn[in.i]!=x
is case sensitive always. Simplest- disable this optimization in noCase mode strcncmp
is case sensitive - we can usesqlite3_strnicmp
in noCase mode- (!) how to determine that current regexp compiled with noCase flag?
(3) By Richard Hipp (drh) on 2021-06-03 13:58:34 in reply to 1.2 [link] [source]
(4) By andse-t610 on 2021-06-04 10:39:08 in reply to 3 [link] [source]
Thank you for your lightning response! With such developers, sqlite is not in danger)
A question along the way - is there any reason not to use flag SQLITE_DETERMINISTIC
when declaring functions regexp
and regexpi
?
what bad (or good or nothing) things can happen if I use it?
I have found that without this flag i can't use regexp
in the partial index (as the docs says).
(6) By Richard Hipp (drh) on 2021-06-04 11:46:51 in reply to 4 [link] [source]
A problem with using the SQLITE_DETERMINISTIC flag on REGEXP and thus permitting its use in the schema, for example in the WHERE clause of a partial index, is that different installations might easily use different definitions of REGEXP. The commonly used ICU extension defines a slightly different REGEXP operator, for example. And I think versions of REGEXP based on Perl regular expressions are in wide circulation. If you create an index using the built-in REGEXP and then some other application tries to use it using a different definition of REGEXP, then the index might not work right.
(7) By andse-t610 on 2021-06-04 12:49:55 in reply to 6 [link] [source]
Clearly. I think we'll get the same problems if we change the collation logic on the existing database (like here)
All clear. Thanks a lot!
(5) By andse-t610 on 2021-06-04 11:06:22 in reply to 3 [link] [source]
The function regexpi
should be mentioned in the docs in lines 10-11