INDEX usage on inequality and IN

(1) By anonymous on 2021-10-19 14:47:50 [link] [source]

Hi all, i write you below some line to reproduce a "strange" behavior I can't understand about using or not index. Why do you think parse engine act like that, and how can I rewrite the query to force sqlite use index?

create table table_a(id INTEGER PRIMARY KEY AUTOINCREMENT,type INTEGER); CREATE INDEX IF NOT EXISTS index_a ON table_a(type);

explain query plan select * from table_a WHERE id=1; QUERY PLAN --SEARCH TABLE table_a USING INTEGER PRIMARY KEY (rowid=?) explain query plan select * from table_a WHERE id!=1; QUERY PLAN--SCAN TABLE table_a

explain query plan select * from table_a WHERE id in (1); QUERY PLAN --SEARCH TABLE table_a USING INTEGER PRIMARY KEY (rowid=?) explain query plan select * from table_a WHERE id not in (1); QUERY PLAN--SCAN TABLE table_a

explain query plan select * from table_a WHERE type=1; QUERY PLAN --SEARCH TABLE table_a USING COVERING INDEX index_a (type=?) explain query plan select * from table_a WHERE type!=1; QUERY PLAN--SCAN TABLE table_a

explain query plan select * from table_a WHERE type IN (1); QUERY PLAN `--SEARCH TABLE table_a USING COVERING INDEX index_a (type=?)

explain query plan select * from table_a WHERE type NOT IN (1); QUERY PLAN `--SCAN TABLE table_a

(2.1) By Scott Robison (casaderobison) on 2021-10-19 15:37:48 edited from 2.0 in reply to 1 [link] [source]

Imagine you have a table with 1000000 rows. The id column is guaranteed to be unique. So when you search for a specific id value, SQLite knows that it can perform a search in logarithmic time by using the index to find the one true record.

When you search for all rows that don't match a single id value, SQLite knows that you will match either all rows or all but one row. Thus the advantage to using the index does not exist when your query will return N or N-1 rows.

The same holds true for the type column without the uniqueness guarantee. The query plan might be different based on statistics, but at this point in time that is the best information that SQLite has available.

(3) By anonymous on 2021-10-20 06:16:10 in reply to 2.1 [source]

Ok understand, but suppose you have 1.000.000 records and only 4 possible values for type. So basically you have 25% of type=1 25% type=2 and so on... So if I have an index on type and I do WHERE type!=1 the index can help me to skip 25% of records, is it correct? Otherwise how can I rewrite my query to benefit from index?

(5.1) By Simon Slavin (slavin) on 2021-10-20 12:18:34 edited from 5.0 in reply to 3 [link] [source]

You have 1,000,000 rows, and only 4 possible values. You are searching for all rows other than those with 1 of those 4 values. The column is indexed, so an index is available.

Under those circumstances the index may be useful. SQLite would figure that out automatically, and you don't need to write any special SQL to make it happen. If SQLite doesn't take advantage of the index, it's because it has figured out that reading the whole table and skipping the rows where type=1 is faster than using the index. (Please note that this is more complicated than you think, since SQLite has to retrieve information from the table for the rows you selected, so it has to read the table anyway.)

If you want to be sure SQLite has the best information to choose the fastest methods, put realistic data in the table and use the ANALYZE command:

https://www.sqlite.org/lang_analyze.html

You can use the command just once. The results are stored in the database and, unless the character of your data changes radically, you don't need to ANALYZE again.

(6) By Ryan Smith (cuz) on 2021-10-20 15:01:11 in reply to 3 [link] [source]

That is called "Cardinality", which in math terms is the size of a set, and in RDBMS terms is the size of the set of possible values in a column.

i.e. A table with 1,000,000 rows with only 4 possible values in Column A is said to have a Cardinality of 4 for Column A. By contrast if Column B was declared Unique and indeed contained all values, its cardinality would be 1 million.

Most SQL engines use exactly this cardinality figure to guess what would be best to do, use the index or not. The problem is they cannot magically "know" what the cardinality of data is, they have to really at some point look through each row in a table and for each column (or at least Indexed column) to calculate the cardinality. in SQLite this is done with:
ANALYZE table_name;

Based on the info stored when ANALYZE is run (which, depending on the compile choices, may store much more than just cardinality), the query planner is well equipped to make such decisions as you suggest above, and is more likely to pick a good index.

If you did not ANALYZE the table, but have your own personal ideas/beliefs about how well the content in a table conforms to the query filtering, you could hint this to the query planner using "likelihood()" - which allows you to say whether the result of a comparison is more likely, or less likely, to be TRUE.

Simon is correct though, the ideas around this and the decision by the query planner to use or not to use a specific index, is non-trivial and after many years of hard work, it still can easily guess/assume wrong, so it really pays to provide hints or analyze tables for anything more than a simple lookup.

(7.1) By Scott Robison (casaderobison) on 2021-10-20 18:26:05 edited from 7.0 in reply to 3 [link] [source]

Others have already mentioned analyze so I won't repeat that here.

One other complication though for using an index when using inequality matching is "where is the inequality in the set". Using your example, let's say type has four possible values, 1, 2, 3, and 4.

SELECT * FROM table WHERE type != 1
SELECT * FROM table WHERE type != 2

The first of these example queries will return all rows where type is 2, 3, or 4, which would be a contiguous set of index entries.

The second of the examples will return all rows where type is 1, 3, or 4, which would be two sets of index entries (get all rows that match type 1, skip all rows that match type 2, then get all rows that match type 3 & 4). Certainly this can be done, but it complicates the query planner based on a number of variables, especially if using prepared statements.

One way to rewrite the query:

SELECT * FROM table WHERE type < 2 OR type > 2

That might better indicate to the query planner that it can divide it up into two subqueries. Or perhaps even better:

SELECT * FROM table WHERE type < 2
UNION ALL
SELECT * FROM table WHERE type > 2

There are other possibilities that have been recommended in other responses.

In the end, you can't assume that the query plan is legitimate unless you are running it against the actual data. It is not enough to simply have a schema and explain a query plan, as the SQLite query planner takes a lot more information into account.

(4) By anonymous on 2021-10-20 10:56:27 in reply to 2.1 [link] [source]

Another example is with LIKE operator if you do WHERE LIKE 'TEST%' explainer tells you will use index WHERE NOT LIKE 'TEST%' just scan table

but you could end up in an example like that in the NOT LIKE statement gives you less result than the first