SQLite query is slower with appropriate index

(1) By bepis (bbepis) on 2022-09-06 08:33:28 [link] [source]

Copy pasting this from https://stackoverflow.com/questions/73618407/sqlite-query-is-slower-with-appropriate-index

I'm trying to improve the performance of a query with this table:

CREATE TABLE frag(hash primary key, type, ulen, clen, offset, refs);

SELECT MAX(offset + clen) AS 'EndPos', type AS 'Type' FROM frag GROUP BY type;

My query plan is as such:

sqlite> EXPLAIN QUERY PLAN SELECT MAX(offset + clen) AS 'Offset', type AS 'Type' FROM frag GROUP BY type;
QUERY PLAN
|--SCAN TABLE frag
`--USE TEMP B-TREE FOR GROUP BY

After creating a new index, the query plan changes to this:

CREATE INDEX max_frag ON frag(type, offset+clen DESC);

sqlite> EXPLAIN QUERY PLAN SELECT MAX(offset + clen) AS 'EndPos', type AS 'Type' FROM frag GROUP BY type;
QUERY PLAN
`--SCAN TABLE frag USING INDEX max_frag

My issue is that my query has slowed from from taking 90s to 280s with this new index. My database is 5.6GB large, with 45.5 million rows. (On an NVMe drive)

Why is a full table scan faster than using this index? .expert is telling me that no new index can improve this

(2.1) By Chris Locke (chrisjlocke1) on 2022-09-06 09:09:32 edited from 2.0 in reply to 1 [link] [source]

Have you run the 'analyze' command at all?

"The statistical data is not automatically updated as the index values change. If the contents or distribution of an index changes significantly, it would be wise to reanalyze the appropriate database or table. Another option would be to simply delete the statistical data, as no data is usually better than incorrect data."

The query optimizer uses this data to decide what to do.

(5) By bepis (bbepis) on 2022-09-06 10:49:20 in reply to 2.1 [source]

Yes, I've used ANALYZE; and it improved the query w/ the index to 180s. It's not anywhere near as good as a raw table scan though

(3) By Chris Locke (chrisjlocke1) on 2022-09-06 09:13:07 in reply to 1 [link] [source]

CREATE TABLE frag(hash primary key, type, ulen, clen, offset, refs);

This doesn't create a nice table. What field types are these? Text? Integers? Is 'Offset' a float or just a '0' or '1' ?

(6) By bepis (bbepis) on 2022-09-06 10:58:16 in reply to 3 [link] [source]

The actual schema is this:

CREATE TABLE frag(hash TEXT primary key, type NUMBER, ulen NUMBER, clen NUMBER, offset NUMBER, refs NUMBER);

All numbers here are unsigned 32-bit numbers, no floats. None of them are floats, or simple boolean 0 or 1s.

sqlite> SELECT * FROM frag LIMIT 5;
hash                      type      ulen  clen  offset  refs
------------------------  --------  ----  ----  ------  ------
BLRXW6H6W4PSVHLXUAO5JZZF  72016144  1217  1154  0       1
DE2GOS2O4G5SZT5ONV32PREM  72016144  1663  1568  1196    105982
QQDNYYEGMCA62DUE22CH5SVF  54137909  346   356   0       1
54BN7R5BOGGPBC27NIXHUBE2  11720834  879   814   0       1
4PVPQGCUNEP65MUAKAKQRDQK  11720834  2090  1872  856     1

It was not defined in the schema since this .db file was provided to me by a third party and I am reading / inserting to it.

(4.2) By Ryan Smith (cuz) on 2022-09-06 09:56:10 edited from 4.1 in reply to 1 [link] [source]

I suppose it is because everywhere you look people advise you to use the Index, since it is faster.

Only problem is, it isn't. An index is a specific tool to enable faster look-ups, but it is much slower than a table scan at visiting rows.

Myself and others have explained this so many times, I feel we should put up a page somewhere to just link to for future reference - but since that doesn't exist yet, here goes:

A database table (especially in SQLite) is in itself an Index, and the fastest possible index because traversing it requires no additional lookup steps. However, it may not be in the order that best serves different query types, and for those, adding an index provides a mechanism for speeding up the lookup for the specific information (by which that index is organized) so much so that even if it is slower to both the index look-up and the do the subsequent lookup into the table, you still save so much time because two indexed look-ups are faster than one walking of the entire data.

In a practical example, imagine you had a numbered list of names (the numbers being the Primary Key), ordered by last-name, then first-name. If someone asked you to go to entry number 347, that would be very easy and very fast, since you understand number-order. If they then ask you to list all the people with "Smith" as last-name, that would be easy too because you understand Alphabetical order, BUT, if asked to list all the people with the first-name "John", that would be less easy, you would have to go through the entire list, read every entry to see if maybe it contains a "John". At this point the smart DB admin will create an index on "first-name" so then to find the "John" names, you can go directly to "John" in this indexed list of first-names and once you locate the first entry for John, see that it can be found at number 72 on your main list, and then sure enough you can look up 72 in the main list index and there you may find a record like "72, Anderson, John". Then move to the next "John", etc.

The thing to note for the above is that the indexed lookup requires TWO look-up operations, first look up "John" in the index to find the key, then look up "72" in the table to find the data. So it takes nearly twice the look-up time (depending on how much smaller the last-name index is than the full table).

For this reason, as soon as the Query planner has reason to believe that a Query will need to visit more than a specific percentage of the main table rows (let's call it 75%+ - simplified example, the actual figure depends on the specific DB's AI and isn't the only factor considered), then it will opt to simply do the table scan since avoiding the double-look-up starts making sense.

It can however be wrong about that in both ways because of wildly unbalanced data:

imagine if 95%+ the first-names in the table were all named "John", for the query mentioned above, the table scan would have been much faster than using the index.
or imagine if the query was to find all where " first-name >= 'C' ", this seems a great case for a table-scan (and indeed would be much faster normally), unless all the names in the database start with 'A' or 'B', which the very first index look-up would have revealed instantly.

Conclusion:

The Indexed look-up can actually be slower, no mystery there.
Without knowing the exact, Schema, Size AND data-shape of the DB, it is hard for us to guess at why the DB engine prefers the Index lookup over the Table-scan or vice versa.
Running "ANALYZE" on the table in question will help the Query planner to know the data shape and make better decisions
Never think that an Index is some form of magic and adding one will always make things faster, in fact they slow things down, except in the specific case where their data-ordering can benefit a specific look-up.
Indexes make inserting much slower too, so best to avoid any unnecessary Index.
Where an Index IS useful, it should absolutely be added to assist the query planner/engine, especially if the queries which it assists are run often.
An Index may also be useful for speeding up the ordering of output by its indexed fields(s) in an ORDER BY clause, even if not used in the FROM/JOINs directly.
Armed with the above knowledge, it is possible for you (the DB admin) to better instruct the DB engine on how to run a specific hefty Query, and you can also provide hints to it in SQLite by using + joins and the "likely" or "likelihood" keywords (a simple docs search on the SQLite site will reveal more information on those)

Hope that is clear.

PS: I think I should actually make this into a page/post somewhere and maybe expand on the QP assisting features - anyone else here with added thoughts on the matter that should also be mentioned, please kindly add it.

(7) By bepis (bbepis) on 2022-09-06 11:12:58 in reply to 4.2 [link] [source]

I've read this post however I don't think this applies to my specific case for a number of reasons:

I am running an aggregate instead of finding a specific row to return data.
Every column (and expressions) I am using in my query is in my index. There should be no reason whatsoever my query should be falling back to a table scan because there is no new information there to get

I've posted a typed schema and example dataset in an above reply for this table.

The index should be perfect for this; I'm asking SQLite to get me the largest value of offset+clen for each specific type; having a multiple column index of (type, offset+clen) means that all that SQLite should be doing is going to each individual type section in the index, and grabbing the first secondary column in that section (which would be the MAX).

In other words, a O(1) operation in total.

In fact I can prove that an index fixes the specific issue I'm trying to solve.

For example, instead of using a GROUP BY type I change it to this:

SELECT MAX(offset+clen) AS 'Offset', type AS 'Type' FROM frag WHERE type = 0;

With the same CREATE INDEX max_frag ON frag(type, offset+clen DESC); index, that query returns instantly; if I drop the index and run it again it takes roughly 30 seconds.

So my code runs a lot faster if I script it on my backend to use WHERE multiple times for each specific type.

This smells like an issue with how SQLite is handling GROUP BY instead of anything on my end. Even running ANALYZE doesn't fix this

(8) By Chris Locke (chrisjlocke1) on 2022-09-06 11:55:37 in reply to 7 [link] [source]

Out of interest, what version of SQLite are you using? The wacky index-choosing issue together with 'group by' sounds like another recent issue, but could be totally unrelated...

(9) By bepis (bbepis) on 2022-09-06 12:25:11 in reply to 8 [link] [source]

I was on 3.33.0, however I've updated to 3.39.2 and I'm still encountering the same issue.

(10) By Ryan Smith (cuz) on 2022-09-06 13:54:25 in reply to 7 [link] [source]

I've read this post however I don't think this applies to my specific case

I am running an aggregate instead of finding a specific row to return data.

Perhaps you've missed the point I was trying to make. That is exactly the kind of case that would be MUCH faster using a table-scan rather than the Index look-up, precisely because you are not using a look-up but are aggregating things, which typically requires reading from all rows.

Now you argue that that should not be the case since you say SQLite can pick the first secondary value from the first item in a group in the index, but how does it know that? It must still scan every group to completion to realize that the first value is indeed the max value, for which a table-scan is still faster. Your "demonstrated" query only succeeds faster because you explicitly informed the query planner that what you are looking for by stating the WHERE clause filter in the exact same way you stated the Index composition (so it can in this case put two and two together and divine the fact with reasoning that goes only one level deep).

As a human you can see that it is possible to simply look at the first value, but the query planner has many things to consider. What you are asking it to do is figure out on its own that the components of an expression matches the components of an index (which it can normally do) but actually only the second-part of the index, AND that the first part of the index is the discriminant for the WHERE clause (which is doable by itself) AND that the MAX() aggregate function on those components implies the top item of a DESC ordering, AND that nothing else will influence the truth of it (such as collations on any/all of the components) - easy for a human to see, but several levels of reasoning deep for the Query planner AI.

Perhaps if a simple rule can be devised it is an improvement request that can be had, but it would have to be a rock solid rule that won't ever cause the query planner to apply it similarly on another query where it may (due to the many different considerations) return results that are inaccurate.

(11) By Simon Slavin (slavin) on 2022-09-07 13:43:26 in reply to 1 [link] [source]

I think SQLite can't figure out the consequences of the addition, and the fact that the addition in your index is the same as the addition in your SELECT. Out of interest, you might try

CREATE TABLE frag(hash primary key, type, ulen, clen, offset, refs,
lastPosition INT GENERATED ALWAYS AS (offset + clen) VIRTUAL);

SELECT MAX(lastPosition) AS 'EndPos', type AS 'Type' FROM frag GROUP BY type;

then try adding an index on (frag, lastPosition) and doing it again.

Does it get better then better ? Worse, then better ? Worse and worse ?

(12) By Chris Locke (chrisjlocke1) on 2022-09-07 14:47:43 in reply to 11 [link] [source]

1 or 2? 1 or 2? 1? 2? 1 or 2?