MAX performance

(1) By anonymous on 2021-12-11 12:07:38 [link] [source]

Hi All,

I have a performance problem with a query involving MAX, where despite having a covering index, it gets progressively slower as more rows are inserted. I think I know why, but am just looking for another pair of eyes to confirm!

Schema:

CREATE TABLE events (
  id INTEGER PRIMARY KEY,
  stream TEXT NOT NULL CHECK(STREAM <> ''),
  data BLOB NOT NULL
);

CREATE INDEX events_stream_id ON events (stream, id DESC);

Query:

SELECT stream, MAX(id) AS id FROM events WHERE stream IN (?) GROUP BY stream;

Explain Query Plan:

SEARCH TABLE events USING COVERING INDEX events_stream_id (stream=?)

I think SQLite is scanning every row in the index instead of "simply" selecting the last. The query optimizer overview article: https://www.sqlite.org/optoverview.html#the_min_max_optimization talks about the "left-most" column. I suspect that the query optimizer is unable to utilise this optimization because my index has 2 columns.

My solution will be to create another table containing the max values, but just want to confirm I haven't missed anything.

Thanks in advance,

Ben

(2) By Keith Medcalf (kmedcalf) on 2021-12-11 12:39:43 in reply to 1 [link] [source]

Since you already know the stream, then why do you not simply do the following:

select max(id)
  from events
 where stream == ?
;

which will accept the same binding and return the only item of information that you do not already have.

(3) By Keith Medcalf (kmedcalf) on 2021-12-11 13:50:02 in reply to 1 [link] [source]

Using select stream, max(id) as id from events where stream == ?' will accept exactly the same bindings and return exactly the same result (except if you send a NULL parameter binding).

To answer your original question: You asked the Query Planner to solve a query that has a GROUP BY and an IN operator (even though the list has only one element, strangely enough). The query optimizer does not know that you asked a bad question thus gave you the answer to the question that you asked, even though it took much longer to answer because the question was bad.

(4) By anonymous on 2021-12-11 14:03:30 in reply to 2 [link] [source]

Thanks for the reply.

Ah, I often execute this query with more than one parameter in the IN clause, making the stream returned as part of the select more useful!

Would I be better executing a separate query for each stream? Would that take advantage of the optimisation?

Thanks again!

Ben

(5) By curmudgeon on 2021-12-11 14:09:42 in reply to 1 [source]

To add to what Keith says, assuming you meant ? to be a list of streams then the following would work

select stream, (select max(id) from events where stream=list.stream) from list;

You could form the list as a temporary table or a carray

https://sqlite.org/carray.html#overview

(6) By Keith Medcalf (kmedcalf) on 2021-12-11 14:24:32 in reply to 4 [link] [source]

Yes, it would work better running the simpler select for each stream one at a time -- you only need to prepare the statement once, just bind one stream at a time to get the result. Alternatively you can send the parameters in a carray or in a temp table or something so that the (sub)query finding the max is executed by itself to guarantee that the minmax optimization is applied.

(7) By anonymous on 2021-12-11 14:32:11 in reply to 6 [link] [source]

Thank you both so much.

Didn’t expect to get such good advice so quickly. Very grateful.

Thanks,

Ben

(8) By Keith Medcalf (kmedcalf) on 2021-12-11 14:33:02 in reply to 6 [link] [source]

Example (note you extend the list differently but bind the same):

select column1 as stream,
       (
        select max(id)
          from events
         where stream == column1
       ) as id
  from (
        values (?)
       )
;

You make the list of values longer by appending ,(?) to the values line.