Efficient Min/Max using window functions?

(1) By Mark Lawrence (mark) on 2022-03-23 15:30:31 [link] [source]

A while ago someone was looking for a way of finding both the minimums and maximums of a set of values grouped by a second criteria that might be duplicated. Example source data:

CREATE TABLE T(
    date INTEGER,
    test CHAR (12)
  );

INSERT INTO T
VALUES
    ( 1, 'clim'),
    ( 3, 'clim'),
    ( 7, 'amb'),
    ( 10, 'amb'),
    ( 12, 'xxx'),
    ( 13, 'clim'),
    ( 15, 'clim'),
    ( 20, 'clim'),
    ( 22, 'amb'),
    ( 25, 'amb');

The desired result:

fromdate  enddate  test
--------  -------  ----
1         3        clim
7         10       amb
12        12       xxx
13        20       clim
22        25       amb

And the best suggestion from the thread on how to achieve that:

CREATE INDEX t_x ON t(date);

WITH x AS (
    SELECT
        t.date AS date,
        t.test AS test,
        max(t2.date) AS key2
    FROM t
    LEFT JOIN t t2
    ON t2.date < t.date
        AND t2.test <> t.test
    GROUP BY
        t.test,
        t.date
  )
SELECT
    min(date) AS fromdate,
    max(date) AS enddate,
    test
FROM x
GROUP BY
    x.key2,
    x.test
ORDER BY
    fromdate,
    enddate,
    test;

(It wasn't exactly the above code - I think it used sub-selects instead of the common table expression).

I guess the index makes the self-join efficient, but I was wondering if window functions could be used to eliminate that. I came up with this:

WITH
    a AS (
        SELECT
            date,
            test,
            lag(test) OVER ( ORDER BY date) IS NOT test AS la
        FROM t
      ),
    b AS (
        SELECT
            date,
            test,
            sum(la) OVER ( ORDER BY date) AS s
        FROM a
      )
SELECT
    min(date),
    max(date),
    test
FROM b
GROUP BY s
ORDER BY 1;

The query plan indicates that this only scans the index, but it does require several more coroutines and an extra sort. Is the trade-off likely to be worth it? Perhaps some experienced minds on the forum who might be inspired to comment on suitability or improvements - any takers?

This is purely a theoretical interest-based exercise: I don't have any real data or benchmarks to run.

(2.2) By Rico Mariani (rmariani) on 2022-03-24 04:25:56 edited from 2.1 in reply to 1 [source]

Custom aggregations are often expensive in SQL and cheap elsewise. If you were doing this a lot a UDF for the aggregation would probably be worth your while. Even with window functions I found this cumbersome and you should be able to do this easily in one pass over the data. Whatever you're using to read from SQLite is probably better equipped to do this. This is CG/SQL for instance.

-- printf is some varargs thing, just call it
declare proc printf no check;

create proc thingy ()
begin
  create table t(
    date integer,
    test text
  );

  insert into t
  values
    ( 1, 'clim'),
    ( 3, 'clim'),
    ( 7, 'amb'),
    ( 10, 'amb'),
    ( 12, 'xxx'),
    ( 13, 'clim'),
    ( 15, 'clim'),
    ( 20, 'clim'),
    ( 22, 'amb'),
    ( 25, 'amb');

   let first_row := true;
   declare c cursor for select * from t;
   declare _min cursor like c;
   declare _prev cursor like c;
   loop fetch c
   begin
     if first_row then
       fetch _min from c;
       fetch _prev from c;
       set first_row := false;
       continue;
     end if;

     if c.test = _prev.test then
       fetch _prev from c;
       continue;
     end if;

     -- found a max, print and reset
     call printf("%d %d %s\n", _min.date, _prev.date, _prev.test);
     fetch _min from c;
     fetch _prev from c;
   end;

   -- if anything pending, print and reset
   if not first_row then
     call printf("%d %d %s\n", _min.date, _prev.date, _prev.test);
   end if;
end;

Any kind of approach like this is gonna be way cheaper than indexing or sorting or whatever... especially if it's not a toy amount of data.