SELECT optimization for constant expression

(1) By Wolfgang Oertl (w.oertl) on 2021-07-20 23:00:15 [link] [source]

I'm trying to construct an SQL SELECT statement that takes parameters which, if NULL, do not influence the result. Of course performance should be good as well, i. e. indices should be used. The reason for this is that I want to use stored procedures (CG-SQL), so the SQL statements are not dynamically constructed to only contain expressions for non-NULL parameters.

In this simple example with just one column in the table (apart from the rowid), the first select of course yields no rows if the parameter ?1 is NULL, but uses the index. The second works as expected (i. e. would yield all rows if ?1 IS NULL), but performs a full table scan. The third is the same as the second but with literals instead of parameters; even then a table scan is performed.

CREATE TABLE t1 (a TEXT NOT NULL);
CREATE INDEX t1_a ON t1(a);
INSERT INTO t1(a) VALUES ('a1'), ('a2'), ('a3');

.parameter init
.parameter set ?1 'a1'

SELECT * FROM t1 WHERE a=?1;

SELECT * FROM t1 WHERE (?1 IS NULL OR a=?1);

SELECT * FROM t1 WHERE ('a2' IS NULL OR a='a2');

SELECT * FROM t1 WHERE a='a2' OR FALSE;

SELECT * FROM t1 WHERE a='a2' OR TRUE;

The second SELECT could be optimized to an index SEARCH after binding, the third right away and the fourth as well. The last one is being optimized and doesn't check a to be 'a2', in fact it has no condition at all (according to EXPLAIN).

Is there any way to arrive at an index-using query plan for SELECT Nr. 2 to 4 (I guess not), or would this require an improvement to the query planner?

This is an example of an actually useful query to illustrate what I'm trying to arrive at:

SELECT * FROM person
WHERE (?1 IS NULL OR firstname LIKE ?1)
AND (?2 IS NULL OR lastname LIKE ?2)
AND (?3 IS NULL OR birthdate == ?3)
AND (?4 IS NULL OR maidenname LIKE ?4)
ORDER BY lastname, firstname;

Thanks.

(2) By MBL (RoboManni) on 2021-07-21 05:53:18 in reply to 1 [link] [source]

Try reformulating your statement using CASE for the conditional decisions:

SELECT * FROM person
WHERE case when ?1 NOT NULL then firstname  LIKE ?1 else 1 end
  AND case when ?2 NOT NULL then lastname   LIKE ?2 else 1 end
  AND case when ?3 NOT NULL then birthdate   ==  ?3 else 1 end
  AND case when ?4 NOT NULL then maidenname LIKE ?4 else 1 end
ORDER BY lastname, firstname;

(4) By Wolfgang Oertl (w.oertl) on 2021-07-21 08:26:05 in reply to 2 [link] [source]

I haven't thought of this - but this makes the query plan more complicated, and does not use the indices. So, this is not an improvement.

(3.1) By Rico Mariani (rmariani) on 2021-07-21 08:15:06 edited from 3.0 in reply to 1 [link] [source]

SELECT * FROM person
WHERE (?1 IS NULL OR firstname LIKE ?1)
AND (?2 IS NULL OR lastname LIKE ?2)
AND (?3 IS NULL OR birthdate == ?3)
AND (?4 IS NULL OR maidenname LIKE ?4)
ORDER BY lastname, firstname;

This query will be challenging indeed.

You really want the results in order so there's a strong tension between

using the index to get the order and filtering those results so that no transient index needs to be created
trying to use indices for selectivity

But the problem is there is only one query plan allowed for the one query... and you might need 4 several different indices. Plus the QP doesn't know that the arguments will be of the form "x%" so its not likely to be tempted to use indices for that.

IS this http://cgsql.dev that you're using and if so do you have other constraints like maybe exactly one of those filters is specified?

Another strategy you could try depending on the selectivity is something like this:

SELECT * FROM person where firstname like 1?
union
SELECT * FROM person where lastname like 2?
union
SELECT * FROM person where birthdate = ?3
union
SELECT * FROM person where maidenname like ?4
order by lastname, firstname

Which might be better... assuming you don't really want all the rows if you specify none of the patterns.

It would be better still if you knew that only one of those is specified. In CG-SQL (CQL) you can do.

create proc getmystuff(fname TEXT, lname TEXT, bdate LONG, mname TEXT)
begin
  if fname is not null then
    SELECT * FROM person where firstname like fname 
    order by lastname, firstname;
  else if lname is not null then    
    SELECT * FROM person where lastname like lname
    order by lastname, firstname;
  else if bdate is not null then
    SELECT * FROM person where birthdate = bdate
    order by lastname, firstname;
  else if mname is not null then
    SELECT * FROM person where maidenname like mname
    order by lastname, firstname;
  end if;
end;

But that only works if you know that exactly one criteria will be specified.

And I'm not sure you're using the CG-SQL that I'm thinking of :D

(7) By anonymous on 2021-07-21 09:32:26 in reply to 3.1 [link] [source]

The query should only return a few rows, so sorting afterwards won't be expensive and I don't worry about this. I do need to combine multiple criteria and I expect that the query planner will use the most selective index or even more than one(?). I think the only solution right now is to dynamically build the query with only those criteria which are given.

Of course I'm talking about this (your) cg-sql - is there any other? So far I have built it from source and tried a few test cases, not much more.

(12.1) By Rico Mariani (rmariani) on 2021-07-21 15:23:36 edited from 12.0 in reply to 7 [link] [source]

If you decide to use manual SQL to cons up the statement you can still use CQL to create the result set shape for you and give you a typesafe contract. All you have to do is make the C code conform to the contract for procedures which is pretty easy. Then you declare the procedure and implement it yourself. It will only have to prepare a statement, so you avoid writing all the getters.

This is what that looks like:

-- example (no indices included in the example, you'll want some)
create table person (
  firstname text,
  lastname text,
  birthdate long,
  maidenname text
);

-- the shape of the arguments and the result are the same as the table
-- this doesn't have to be the case... but it is convenient if it works out
-- you can have different arguments and still return a person.  You
-- might need to do that if you have different nullability for instance.
declare proc match_various(like person) (like person);

-- if match_various is written by hand it only has to prepare the statement
-- and return it in the out argument
--
-- see the .c file for the contract of match_various
--
-- if you do it differently it will have arguments that match your spec.

-- extern CQL_WARN_UNUSED cql_code match_various(
--    sqlite3 *_Nonnull _db_,
--    sqlite3_stmt *_Nullable *_Nonnull _result_stmt,
--    cql_string_ref _Nullable firstname_,
--    cql_string_ref _Nullable lastname_,
--    cql_nullable_int64 birthdate_,
--    cql_string_ref _Nullable maidenname_);

-- the following calls match_various and creates a result set
-- the .h file will have the various functions you need to read from it
create proc return_various(like match_various arguments)
begin
  call match_various(from arguments);
end;

(5.1) By John Dennis (jdennis) on 2021-07-21 08:40:08 edited from 5.0 in reply to 1 [link] [source]

I have a similar table, named df_players.  Even ignoring all the parameters and the IS NULL, a very simple single column query demonstrates that the LIKE causes the table scan:

explain query plan select * from df_players where surname LIKE 'Dennis';
QUERY PLAN
`--SCAN TABLE df_players

explain query plan select * from df_players where surname = 'Dennis';
QUERY PLAN
`--SEARCH TABLE df_players USING INDEX pix1 (surname=?)

(6) By Wolfgang Oertl (w.oertl) on 2021-07-21 08:50:17 in reply to 5.1 [link] [source]

Create the index with COLLATE NOCASE. Only then will LIKE use the index, because LIKE is by default case insensitive. The CREATE INDEX I would have used for my example (not shown) would be like this.

CREATE INDEX df_players_surname ON df_players(surname COLLATE NOCASE);
EXPLAIN QUERY PLAN SELECT * FROM df_players WHERE surname LIKE 'Dennis';

(8) By John Dennis (jdennis) on 2021-07-21 10:15:28 in reply to 6 [link] [source]

Thanks. Showing my background in another database, where LIKE is case sensitive by default. Sorry for the red herring.

(11.1) By MBL (RoboManni) on 2021-07-21 13:44:27 edited from 11.0 in reply to 6 [source]

there was recently (the last few days) another thread about query plan in conjunction with LIKE.

If a pattern does not start with a constant (or literal?) then SCAN needs to be done instead of being able to use the index.

Maybe a splitting into an ugly looking sequence could help to make the query plan look as expected.

Example of what I mean

...
case substr(?1 COLLATE NOCASE,1,1) 
  when 'A' then name like 'A'||substr(?1,2)
  when 'B' then name like 'B'||substr(?1,2)
  when 'C' then name like 'C'||substr(?1,2)
...
  when 'Y' then name like 'Y'||substr(?1,2)
  when 'Z' then name like 'Z'||substr(?1,2)
else 
  name like ?1  -- uses SCAN query plan
end

EDIT: copy-paste-error correction

(14) By Rico Mariani (rmariani) on 2021-07-21 19:20:07 in reply to 11.1 [link] [source]

Oh that's cute!

(15) By Ryan Smith (cuz) on 2021-07-21 20:09:48 in reply to 11.1 [link] [source]

The name that comes to mind for this trick is: "Brute Forced Optimization". :)

(16) By Wolfgang Oertl (w.oertl) on 2021-07-21 20:43:54 in reply to 11.1 [link] [source]

This other thread about LIKE was also started by me and mostly concerned the ICU extension (for Unicode) in conjunction with LIKE. Anyway, the splitting above is useless. SQLite will make a new query plan after binding of parameters and therefore can detect LIKE patterns that do not start with a wildcard character, and it will automatically use an index; a few more conditions have to be met, however - read the fine material (esp. the last paragraph of section 5).

(9) By Ryan Smith (cuz) on 2021-07-21 11:03:05 in reply to 1 [link] [source]

I might be ignorant, but is not this query:

SELECT * FROM person
WHERE (?1 IS NULL OR firstname LIKE ?1)
AND (?2 IS NULL OR lastname LIKE ?2)
AND (?3 IS NULL OR birthdate == ?3)
AND (?4 IS NULL OR maidenname LIKE ?4)
ORDER BY lastname, firstname;

the same as this more appropriate should-be-index-using query:

SELECT * FROM person
 WHERE firstname LIKE ?1
    OR lastname LIKE ?2
    OR birthdate == ?3
    OR maidenname LIKE ?4
    OR (?1 IS NULL AND ?2 IS NULL AND ?3 IS NULL AND ?4 IS NULL)
ORDER BY lastname, firstname;

You could even leave out the final OR based on whether you would want to show No rows when no filters are set, or All rows in that case.

I have not tested the assumptions here, so I could be wrong about it.

(10) By Ryan Smith (cuz) on 2021-07-21 11:24:29 in reply to 9 [link] [source]

Nevermind, I am ignorant :)

The firs query's conditions are additive. Ignore that, I will look into a different solution.

(13) By Wolfgang Oertl (w.oertl) on 2021-07-21 15:26:40 in reply to 1 [link] [source]

I've just tried this in PostgreSQL for comparison on a table with 1000 rows and three columns (id, a TEXT, b TEXT) and indices on a and b.

wolfgang=> explain select * from t1 where (a = 'a500' or 'a500' is null);
                           QUERY PLAN                           
----------------------------------------------------------------
 Index Scan using t1_a on t1  (cost=0.28..8.29 rows=1 width=12)
   Index Cond: (a = 'a500'::text)
(2 Zeilen)

wolfgang=> explain select * from t1 where (a = 'a500' or 'a500' is null) and (b = null or null is null);
                           QUERY PLAN                           
----------------------------------------------------------------
 Index Scan using t1_a on t1  (cost=0.28..8.29 rows=1 width=12)
   Index Cond: (a = 'a500'::text)
(2 Zeilen)

wolfgang=> explain select * from t1 where (a = 'a500' or 'a500' is null) and (b = 'b500' or 'b500' is null);
                           QUERY PLAN                           
----------------------------------------------------------------
 Index Scan using t1_b on t1  (cost=0.28..8.29 rows=1 width=12)
   Index Cond: (b = 'b500'::text)
   Filter: (a = 'a500'::text)
(3 Zeilen)

So here the behavior is as I expect, assuming that "Index Scan" means an index lookup - one index is used, and the other condition is either ignored (if the parameter is NULL) or applied as a filter. It seems to work exactly the same with parameters, like this:

PREPARE foo1 (text) AS SELECT * FROM t1 WHERE ($1 IS NULL OR a=$1);
EXPLAIN EXECUTE foo1('a500');