Usage of suboptimum index

(1) By Hardy on 2020-07-30 15:06:27 [link] [source]

I have the following query:

SELECT
  Relations.IDSecondDatum,Games.Result
FROM
  Games,Relations,Classifications
WHERE
  (Games.BlackID=?1)AND
  (Relations.TypeFirstDatum==2)AND
  (Relations.IDFirstDatum=Games.ID)AND
  (Relations.TypeSecondDatum=3)AND
  (Relations.IDSecondDatum=Classifications.ID)AND
  (Classifications.System=?4)AND
  (Classifications.GameType=?5);

(Games.ID equivalent to Games.rowid, Classifications.ID equivalent to Classifications.rowid)

And these indices on the Relations table:

CREATE INDEX Relations_Index1 ON Relations (IDFirstDatum,IDSecondDatum,TypeFirstDatum,TypeSecondDatum);
CREATE INDEX Relations_Index2 ON Relations (TypeFirstDatum,TypeSecondDatum,IDFirstDatum,IDSecondDatum);
CREATE INDEX Relations_Index3 ON Relations (IDSecondDatum,IDFirstDatum,TypeSecondDatum,TypeFirstDatum);
CREATE INDEX Relations_Index4 ON Relations (TypeSecondDatum,TypeFirstDatum,IDSecondDatum,IDFirstDatum);

The stats for the indices are:

Relations|Relations_Index4|80965 80965 80965 40 1
Relations|Relations_Index3|80965 40 1 1 1
Relations|Relations_Index2|80965 80965 80965 1 1
Relations|Relations_Index1|80965 1 1 1 1

Why does the query planner like to use Relations_Index2?

QUERY PLAN
|--SEARCH TABLE Games USING COVERING INDEX Games_Index2 (BlackID=?)
|--SEARCH TABLE Relations USING COVERING INDEX Relations_Index2 (TypeFirstDatum=? AND TypeSecondDatum=? AND IDFirstDatum=?)
`--SEARCH TABLE Classifications USING INTEGER PRIMARY KEY (rowid=?)

Is Relations_Index1 not the most efficient one?

(2) By Dan Kennedy (dan) on 2020-07-30 16:01:14 in reply to 1 [source]

Is Relations_Index1 not the most efficient one?

It might be. We need the rest of the schema to know.

To use Relations_Index1 efficiently requires a key value for IDSecondDatum. Which means moving table Relations to the innermost join of the QUERY PLAN. Which means we can't use the extremely efficient rowid index for table Classifications. What's the alternative to that? Are there indexes on Classifications.System or Classifications.GameType?

Dan.

(3) By Hardy on 2020-07-30 21:53:24 in reply to 2 [link] [source]

Hi Dan,

I do not think that using the table Relations has to be the innermost join. The order should be as the query plan mentioned but I thought that this query plan is the most efficient one:

SEARCH TABLE Games USING COVERING INDEX Games_Index2 (BlackID)=?) SEARCH TABLE Relations USING COVERING INDEX Relations_Index1 (IDFirstDatum) SEARCH TABLE Classifications USING INTEGER PRIMARY KEY (rowid=?)

because the first search returns the rowIDs of Games, Relations_Index1's statistics indicate that for each Games.rowid there is at most one match. This index has as the second key the IDSecondDatum. Then, also the classifications table can be checked using the specified index using Classification's rowid.

Probably, I just do not understand how SQLite internally works...

Regards, Hardy

(4) By Richard Hipp (drh) on 2020-07-30 22:08:02 in reply to 1 [link] [source]

You can force SQLite to use whatever index you want using the INDEXED BY clause on the query. This is not recommended in production, but is useful for debugging and analysis.

So, have you tried your query using each of the various indexes to see which version runs fastest?

(5) By Dan Kennedy (dan) on 2020-07-31 11:19:30 in reply to 3 [link] [source]

I see. Managed to miss that. You're correct of course, according to your stats:

    Relations|Relations_Index2|80965 80965 80965 1 1
    Relations|Relations_Index1|80965 1 1 1 1

both of these:

    WHERE TypeFirstDatum=? AND TypeSecondDatum=? AND IDFirstDatum=?
    WHERE IDFirstDatum=?

should return on average 1 row. Technically it's "on average", not "at most", so as far as SQLite can tell there might be some values that match more than one key.

So SQLite calculates the cost of using each index as pretty much equal. However, it has a heuristic that says that if one index uses a superset of the WHERE terms used by another, it must be better:

https://www.sqlite.org/src/artifact/2ea911238674e?ln=1999..2010

You could argue that this heuristic is counter-productive - if both indexes will match a single row, why not use the shorter one and save a little CPU doing key-comparisons (also, in other cases but not this one, you might save some IO as shorter indexes take up less space on disk)? I think the answer is that (a) those stats are only averages, not hard guarantees and (b) the stats might be out of date - they were current when ANALYZE was run but things might have changed since. So SQLite figures using the index that covers more fields is a better bet.

Dan.

(6.1) By Keith Medcalf (kmedcalf) on 2020-07-31 17:28:50 edited from 6.0 in reply to 5 [link] [source]

You could argue that this heuristic is counter-productive - if both indexes will match a single row, why not use the shorter one and save a little CPU doing key-comparisons (also, in other cases but not this one, you might save some IO as shorter indexes take up less space on disk)? I think the answer is that (a) those stats are only averages, not hard guarantees and (b) the stats might be out of date - they were current when ANALYZE was run but things might have changed since. So SQLite figures using the index that covers more fields is a better bet.

Well, actually, no, you could not argue that.

Those conditions must still be checked and passed before you can descend into the next table. If the values are not in the index then additional IO needs to be done to seek and fetch them from the row.

Therefore having more where clause conditions consumed in an index lookup will always be lower cost than consuming less where clause conditions in the index because those conditions must still be enforced.

This is why a "covering index" is more efficient than a "non-covering" index. That applies to where conditions, select list outputs, and descent keys as well.

(7) By Dan Kennedy (dan) on 2020-07-31 18:12:23 in reply to 6.1 [link] [source]

Quite so. The heuristic is only used if, as in this case, both indexes are either covering or non-covering. More completely:

http://www.sqlite.org/src/artifact/2ea911238674e?ln=1956-1960

(8.1) By Hardy on 2020-07-31 22:25:06 edited from 8.0 in reply to 4 [link] [source]

Hi,

yes,

Relations_Index1: 0.0177s Relations_Index2: 99.35s

Index2/Index1 = 5613!

Index1 must be the fastest because there is at most one match (the statistics are matching exactly in my case).

Regards, Hartwig

PS: Is there also a possibility to enforce the order of the index usage?

(9) By Hardy on 2020-07-31 22:08:10 in reply to 5 [link] [source]

Hi,

as far as I understand how SQLite works is that both indices are leading to exactly one match but for Index2 SQLite needs in the worst case 80965 (or log(80965)?) iterations to find this out. While for Index1 there is only at most one iteration.

Regards, Hardy

(10) By Richard Hipp (drh) on 2020-07-31 23:05:50 in reply to 8.1 [link] [source]

Is there also a possibility to enforce the order of the index usage?

I don't know what that means? How is the usage of indexed "ordered"?

(11) By Richard Hipp (drh) on 2020-07-31 23:16:24 in reply to 9 [link] [source]

That's not what the stat1 data you showed us means.

The stat1 data for Index2 is "80965 80965 80965 1 1". The first number is the number of entries in that index. The second number is the number of rows that you would expect to select using an equality constraint on just the first term. The third number is the number of rows you would expect to select using equality constraints on the first two terms. The fourth number is the number of rows you would expect to select using equality constraints on the first three terms.

The fourth number is 1, so we expect that the equality constraints on the first three terms of the index will narrow down the search to just 1 row.

Since you have equality constraints on the all four columns of both Index1 and Index2, SQLite expects that a single binary search against either of these indexes to narrow down to just one row.

I don't know why that isn't work for you.

Perhaps we could analyze the situation better if you supplied us with the complete schema for your database together with all of the stat1 data. Basically, we'd like to see the output of this command:

  .fullschema

It would also be helpful to see the complete output from the following four commands:

  EXPLAIN QUERY PLAN
  SELECT
     Relations.IDSecondDatum,Games.Result
  FROM
     Games,Relations,Classifications
   WHERE
     (Games.BlackID=?1)AND
     (Relations.TypeFirstDatum==2)AND
     (Relations.IDFirstDatum=Games.ID)AND
     (Relations.TypeSecondDatum=3)AND
     (Relations.IDSecondDatum=Classifications.ID)AND
     (Classifications.System=?4)AND
     (Classifications.GameType=?5);

  EXPLAIN QUERY PLAN
  SELECT
     Relations.IDSecondDatum,Games.Result
  FROM
     Games,Relations INDEXED BY Relations_Index1,Classifications
   WHERE
     (Games.BlackID=?1)AND
     (Relations.TypeFirstDatum==2)AND
     (Relations.IDFirstDatum=Games.ID)AND
     (Relations.TypeSecondDatum=3)AND
     (Relations.IDSecondDatum=Classifications.ID)AND
     (Classifications.System=?4)AND
     (Classifications.GameType=?5);

  SELECT sqlite_source_id();

  PRAGMA compile_options;

(12) By Richard Hipp (drh) on 2020-07-31 23:18:11 in reply to 9 [link] [source]

Here's a thought: Have you run ANALYZE recently? Are you sure the stat1 data is current? Please run ANALYZE again before you send us the output of the ".fullschema" command.

(13) By Hardy on 2020-08-01 09:25:51 in reply to 8.1 [link] [source]

Hi,

in my example the query plan starts with an index of the Games table, then with one of the Relations table and afterwards of the classification table. Is it possible to enforce this order or doesn't it matter?

Best regards, Hardy

(14) By Hardy on 2020-08-01 09:32:55 in reply to 8.1 [link] [source]

Hi,

sorry, I have to revise my answer. It seems to be that the evaluation order of the indices was changed when I tried to enforce the usage of certain indices. Now, I was able to enforce the best order and the differences are negligible but still Index1 is marginally faster: 0.020 against 0.023.

Regards, Hardy

(15) By Gunter Hick (gunter_hick) on 2020-08-03 05:51:32 in reply to 13 [link] [source]

Maybe what you are looking for is CROSS JOIN, which forces SQLite to keep the order of the tables as presented in the query.

(16) By Gunter Hick (gunter_hick) on 2020-08-03 06:03:00 in reply to 1 [link] [source]

Because IDSecondDatum is the rowid of the Classifications Table, which is the only item necessary to get from Relations to Classifications, and appears last in that index. The other three elements of the index are already known from Games x Relations.