How to determine the best index to create given a query?

(1.1) By 6kEs4Majrd on 2022-03-01 03:42:28 edited from 1.0 [link] [source]

https://www.sqlite.org/eqp.html

The above doc seems to be about given indexes already created what is the most effective use of the indexes given a query.

What if I want to know the best index for a given query?

For the following example, (col1, col2, col3) is the primary key.

SELECT col1, col2, col3, col4 FROM mytab WHERE col2<someval2 AND col3<someval3 ORDER BY col1, col2, col3

Is it best to have an index on (col2, col3)? Or two indexes, one is on col2 only, and the other is on col3 only?

What if the example is changed to

SELECT col1, col2, col3, col4 FROM mytab WHERE col2<someval2 AND col3<someval3 ORDER BY col1, col2

In order words, does the sort order affect what indexes should be created?

Is there an easy-to-follow rule to determine the best index given an arbitrary query?

(2) By Keith Medcalf (kmedcalf) on 2022-03-01 03:57:09 in reply to 1.1 [link] [source]

In order words, does the sort order affect what indexes should be created?

yes.

Is there an easy-to-follow rule to determine the best index given an arbitrary query?

see the CLI .expert command

(5) By 6kEs4Majrd on 2022-03-01 14:34:59 in reply to 2 [link] [source]

Can .expert give the best indexes needed for a query? If so, can I just use the result to automatically create the needed indexes whenever I issue a query?

(9) By Simon Slavin (slavin) on 2022-03-02 18:45:01 in reply to 5 [link] [source]

.expert makes output intended for humans to read, not for software to process.

Create your tables
Put convincing, lifelike data in them. This stage is important, the data is used in the analysis.
Use ANALYZE
Use the .expert command
Make the indexes it shows

Repeat (4) and (5) until no new indexes are suggested.

(3) By Gunter Hick (gunter_hick) on 2022-03-01 07:14:46 in reply to 1.1 [link] [source]

Here is a simple albeit tedious way to determine the optimal set of indices to create:

- Take the set of queries your application intends to run against a certain table.
- Modify the table definition to not have UNQIUE or PRIMARY KEY (except the magical INTEGER PRIMARY KEY) in favor of explicitly created UNIQUE indices
- Add more indices to cover all of the fields mentioned in any of the queries in every possible order of fields mentionend in WHERE, ORDER BY, GROUP BY and JOIN ON clauses. Include all fields in the select list.
- Load you tables with representative data, both in structure and in volume
- run ANALYZE
- prefix each query with EXPLAIN QUERY PLAN and take note of the indices used
- for plain JOIN (or comma in the FROM clause) use CROSS JOIN and permute the order of the tables, while timing the actual run time of the query (and noting the query plan)
- try to eliminate preload bias (OS disk cache, SQLite page cache) as much as possible

You may then determine which query plans SQLite's NGQP thinks are best and where it is mistaken, yielding the optimal set of queries and the set of indices mentioned in them. Drop any index not mentioned.

Good judgement may be exercised in pruning the set of queries to run, but rememeber that good judgement comes from experience, and a lot of that comes from bad judgement. Also, determining the "representative data" may be quite difficult until well after the application has been deployed.

(4) By 6kEs4Majrd on 2022-03-01 14:33:31 in reply to 3 [link] [source]

Add more indices to cover all of the fields mentioned in any of the queries in every possible order of fields mentionend in WHERE, ORDER BY, GROUP BY and JOIN ON clauses. Include all fields in the select list.

So I have to create indexes to cover all possible cases in order to determine what indexes are necessary?

prefix each query with EXPLAIN QUERY PLAN and take note of the indices used

There is no way to 100% correctly tell what indexes are needed just using the query without creating the above-mentioned indexes? If so, I don't quite understand why knowing the query is not sufficient to determine what the best indexes should be. Could you please explain the logic behind?

(6) By Gunter Hick (gunter_hick) on 2022-03-01 15:19:14 in reply to 4 [source]

For simple queries, the "best" index is easily determined.

Looking at "SELECT <field_list> FROM table ORDER BY a,b,c", it is obvious that an index on table(a,b,c) is useful to avoid sorting the records.

Looking at "SELECT <field_list> FROM table WHERE a = 5 and b < 7", it is obvious tha the same index is useful.

Looking at "SELECT a,b,c FROM table WHERE a = 5 and b < 7", it is less obvoius that the index can be used to retrieve the value of column c without ever having to retrieve an actual row. This is called a covering index.

It starts to get complicated when more tables and constraints are added.

Take "SELECT .. FROM t1 JOIN t2 ON (t1.a = t2.a)" for example. If there is an index on t2.a but none on t1.a, then matching a t2 row to a given t1 row becomes an index search, while matching a t1 row to a given t2 row requires a full table scan. So you must add both indices to find out which one the query planner prefers. And you need to time both "t1 CROSS JOIN t2" and "t2 CROSS JOIN t1" in the presence of both indices to find out which is actually faster. This may not be the same, especially if the "shape" of your test data does not match that of the production data, or if you have not run ANALYZE on a database populated with  typical production data.

There are analytical rules to derive probably good indices, which the .expert command applies. And there are measurements to find out the best indices from the set of available indices. You cannot measure the performance of an index that is not present. But you can eliminate an unneeded index from the production schema.

(7) By Ryan Smith (cuz) on 2022-03-01 18:32:19 in reply to 4 [link] [source]

So I have to create indexes to cover all possible cases in order to determine what indexes are necessary?

Wait, you are confusing the people here and yourself. There are two possible things you can be interested in and from your statements it is hard to know which exactly you are after.

You could wish to know:

Which Indexes are well suited to most queries on your tables
Which Indexes are the very best possible for all queries on your tables

Those two questions have markedly different processes of figuring out, from the quick and easy solutions offered for 1, to Gunter's very thorough solution for 2 and many in-between. It's like asking "How should I exercise to run much faster?" vs. "How should I exercise to be the fastest athlete in the World?". Only one of those have a relatively simple answer.

If so, I don't quite understand why knowing the query is not sufficient to determine what the best indexes should be

That is because the query planner takes a lot of things into account to decide which Indexes (if indeed any) it will use to iterate and/or look-up data, which you cannot know by simply staring at the schema/query. This is true of all Database engines, not just SQLite.

Some of the things that are taken into account are the types of joins asked for, the order of joins, the sub-queries or CTE queries processed alongside, whether the output is sorted or not, whether the sub-query data is sorted or not, the shape of the data (if that is known), etc.

The decision matrix is complex (like most forms of AI) and is often un-intuitive and quite hard to follow, even for the designers. That's why EXPLAIN QUERY PLAN and its ilk exist.

Further to this it will often invent a new Index where it thinks the index will save time proportionally higher than the cost of making the index. When it does, it's easy to see that adding THAT specific index will help.

The .expert mode enables one to "peek" into its soul to see whether it would have preferred an index, if perhaps not as much as to construct it during the query, so you can decide to add it to indeed help.

That's all good and helpful, with one caveat that you are unlikely to think of ALL the ways a table or set of tables can be queried and joined together at the onset. That's where Gunter's suggestion is gold, but it too requires further exploration in that the shape of data (as more and more data is added) may itself either affect the speed of a query that seemed fast at the start, or directly influence the query planner's decisions - why it was suggested that you add data resembling the real data that will fill your tables eventually and try the various resolves on those.

I'll end with this thought: This is not an easy problem. It keeps many software/DB engineers up at night today as it has in the past. The SQLite devs have spent 20 years+ refining this query planner and are still very much at it. If you can come up with a way to universally "just look at a query" and figure out what would be the best indexes for it, you'll be rich.

(8) By doug (doug9forester) on 2022-03-01 20:21:26 in reply to 7 [link] [source]

Ryan, two words: machine learning. What if we took some 100 million queries and fed them into a machine learning algorithm like Amazon's AWS has? Obviously, the queries by themselves would not be sufficient. What would the AI learning machine need to be able to answer the question: what set of indices are best for my system? Obviously, it's data dependent, but the form of the data is just one more input to the AI.

Some of the inputs would be: the sql; the set of tables and their columns and attributes; how the query performed including the set of SQLite subfunctions; what indices exist; quantify and characterize the data; ...

The benefits: faster queries, faster SQLite, streamlined user processes, ???

How to proceed: Instrument SQLite to generate stats and collect them (with user permission, of course). Lots of considerations about performance, privacy, etc. Outputs from AI might include suggested indices, data reorg, column affinity changes, SQLite action changes/additions, ...

AWS is giving away a year's free AI machine learning right now - I don't know what the catches are.

(10) By Simon Slavin (slavin) on 2022-03-02 18:49:46 in reply to 8 [link] [source]

We do not need to do this. The problem that started this thread does not require AI to solve it. The work needed is already present in the .expert command as described above.

It's worth remembering that the more indexes you have on your tables, the larger the database file is, the more space the database file takes up, the slower every INSERT/UPDATE/DELETE command is. And that the amount these things change is highly dependent on your OS, FS and hardware: you can keep the same version of SQLite and the same data but get completely different results because you have a better motherboard with a faster bus. There is no point in running tests on an abstract SQLite machine.