LIKE optimization - limitation to string literals

(1) By Wolfgang Oertl (w.oertl) on 2021-07-18 22:08:06 [link] [source]

As stated in the documentation, the LIKE operator will only use an index if the right hand side is a string literal or a parameter bound to a string literal (among a few other conditions). Why is an expression yielding a constant string not acceptable? Specifically, I have tried this:

CREATE TABLE t1 (
  id INTEGER NOT NULL PRIMARY KEY,
  name TEXT NOT NULL,
  name_simple TEXT NOT NULL AS (UPPER(name)) STORED
);
CREATE INDEX t1_name_simple ON t1(name_simple);
PRAGMA case_sensitive_like = on;

EXPLAIN QUERY PLAN SELECT *
FROM t1
WHERE name_simple LIKE 'A%';

EXPLAIN QUERY PLAN SELECT *
FROM t1
WHERE name_simple LIKE UPPER('a%');

Output:

QUERY PLAN
`--SEARCH t1 USING INDEX t1_b (name_simple>? AND name_simple<?)
QUERY PLAN
`--SCAN t1

The first SELECT uses the index, while the second doesn't. How about allowing this kind of RHS? When using EXPLAIN, I see that a "Once" opcode skips over the UPPER() invocation on all but the first iteration, so it seems to be known to the query planner that this is a constant.

When ICU is used, the like() operator is overloaded and then the LIKE optimization is also disabled. I plan to implement a function in place of UPPER that converts strings (persons' names) into plain uppercase ASCII, thereby allowing efficient searching. Of course I could first apply that function from C before binding the parameter, but I'd rather keep the logic in the SQL statement. Thanks for your opinions.

(2) By Keith Medcalf (kmedcalf) on 2021-07-18 22:56:23 in reply to 1 [link] [source]

Why is an expression yielding a constant string not acceptable?

I would suspect that this is because there is no way to examine that "constant string" at compile (prepare) time in order to determine whether or not an index will be useful.

Since this cannot be determined at compile (prepare) time, the statement must be prepared so that it can work under all conditions -- meaning that a scan of all rows must be performed.

Changing this would require that the compile (prepare) be able to "run" the code generating the RHS so that the value can be examined to determine if an index will be useful. Prepare (compile) does not run (execute) the code it is generating.

The only thing that the code generator could do would be to generate bytecode handling both paths and dynamically choose which one to use at execution time.

While this would solve the issue you are seeing, you could also solve it by simply passing in the RHS directly and not diddling it in the dark. Although the prepare could be modified to generate output that works most efficiently even when diddling in the dark occurs, this would make the resulting VDBE program less efficient in all cases except in cases where the RHS has been diddled in the dark.

So the question really boils down to whether to have every other user of SQLite3 pay for you wanting to "diddle in the dark" whenever the mood strikes, or whether you should merely "diddle in the light" in which case the issue does not exist for anyone.

(3) By Keith Medcalf (kmedcalf) on 2021-07-18 23:11:46 in reply to 1 [link] [source]

You could just declare an index properly:

CREATE INDEX t1_name ON t1(name collate nocase);

getting rid of the current extraneous name_simple column and t1_name_simple index, not futz with the case_sensitive_like setting and merely do:

select * from t1 where name like 'A%';
-- or --
select * from t1 where name like 'a%';

which will then work correctly and use the index since the RHS is now used directly and the provided value does not start with a wildcard.

If you really want to have a name_simple column defined like that for some reason, then simply fix your index (create index t1_name_simple on t1 (name_simple collate nocase)), get rid of the diddling with the case_sensitive_like setting, and simply:

select * from t1 where name_simple like 'A%';
-- or --
select * from t1 where name_simple like 'a%';

Neither of these have any requirement for diddling in the dark.

(4) By Wolfgang Oertl (w.oertl) on 2021-07-19 06:43:24 in reply to 3 [link] [source]

Thank you for your suggestions, which I have tried before and found that of course this doesn't work outside of ASCII, like German umlauts. After loading the ICU extension, like() is overridden and indices aren't used - regardless of any fiddling with COLLATE - for all LIKE operators anywhere, a possibly hard to detect side effect even for COLLATE NOCASE indices or fields. I'll have to do without this extension, I think, but then I'm back to an artificial ASCII only name column.

Note that the ICU extension's LIKE stops working after PRAGMA case_sensitive_like is set. This is an interesting side effect that is documented in ICU's README referenced above. There seems no way to go back to the ICU like() implementation after setting the PRAGMA.

.load libIcu.so
CREATE INDEX t1_name ON t1(name);
CREATE INDEX t1_name_nc ON t1(name COLLATE NOCASE);
INSERT INTO t1(name) VALUES ('Ägidius');

EXPLAIN QUERY PLAN SELECT * FROM t1 WHERE name LIKE 'ä%';
PRAGMA case_sensitive_like=on;
EXPLAIN QUERY PLAN SELECT * FROM t1 WHERE name LIKE 'ä%';
PRAGMA case_sensitive_like=off;
EXPLAIN QUERY PLAN SELECT * FROM t1 WHERE name LIKE 'ä%';

Output:

QUERY PLAN
`--SCAN t1
QUERY PLAN
`--SEARCH t1 USING COVERING INDEX t1_name (name>? AND name<?)
QUERY PLAN
`--SEARCH t1 USING COVERING INDEX t1_name_nc (name>? AND name<?)

Only the first variant, before any PRAGMA case_sensitive_like, uses the ICU like() function and yields the desired row, but performs a full table scan.

Disregarding all of the above, I think a query like this should use an index (with COLLATE NOCASE):

SELECT * FROM t1 WHERE name LIKE RTRIM(?) || '%';

As I wrote in my first post, the byte-code for this statement contains a "Once" opcode that skips the computation of the right hand side in the second and later iterations, so the constant nature of the RHS is already known to the query planner.

(5) By Keith Medcalf (kmedcalf) on 2021-07-19 18:19:38 in reply to 4 [link] [source]

As a general process, the value of parameters may be used to optimize the generated query plan. However, at the time the prepare is first executed the values of the parameters are completely unknown (they are not available), so code is generated which will work for any and all values of the parameters. This should always result in a "full scan" of all candidates to any LIKE or GLOB constraint because there is no way of knowing whether or not the parameter will begin with a wild-card. The resulting VDBE code must work whether or not the parameter provided at "runtime" starts with a wild-card. There are a few other conditions in which knowledge of the values of parameters may affect the query plan because the value affects the selectivity of an index.

In all such cases, the query planner has to generate a plan that will work no matter what the value of the parameter(s) since the parameter values are unknowable at prepare time.

However, at the time of execution of the first step() the values of the parameters is available. If STAT4 is enabled then the query plan will be revisited at the first step (since parameters are now bound and knowable) and the query plan may be revised based on the newly available parameter data if examination of the parameter data indicates and generates a more efficient plan by, for example, allowing the use of an index which has become more (or less) selective.

However, when generating the query plan I do not think the optimizer looks deeply into parameter values and does not evaluate expressions (they are black boxes since the planner does not execute the query yet) and so, for example, in cases where you used an expression that transmutes the value of a parameter before using it (an arithmetic expression, function, etc) there is no way to evaluate the affect of the result of the expression from the value of the parameter, hence the planner must generate code that will always work.

So, in your example:

SELECT * FROM t1 WHERE name LIKE RTRIM(?) || '%';

there is no way to know that the parameter will allow the optimization of using an index to constrain the table search because it is not used directly -- it is part of an expression. If, however, you did the RTRIM(?) || '%' in the application code and executed:

SELECT * FROM t1 WHERE name LIKE ?;

passing the result of RTRTIM(?) || '%' as a parameter, the planner can now examine the parameter at the first step (after the parameters are bound) and if the first character is not a wild-card then re-generate the query plan to include a partial scan (via an index) if the proper indexes are available.

Overriding the LIKE function and various different collations being available will obviously change how this process works.

I believe this is an accurate description of how the query plan is generated in such circumstances and if not, then I am sure that either Richard or Larry will correct the inaccuracies.

(6) By Wolfgang Oertl (w.oertl) on 2021-07-19 20:29:28 in reply to 5 [source]

Thanks for the clear and insightful explanation, as always!