Efficient Min/Max using window functions?
(1) By Mark Lawrence (mark) on 2022-03-23 15:30:31 [link] [source]
A while ago someone was looking for a way of finding both the minimums and maximums of a set of values grouped by a second criteria that might be duplicated. Example source data:
CREATE TABLE T(
date INTEGER,
test CHAR (12)
);
INSERT INTO T
VALUES
( 1, 'clim'),
( 3, 'clim'),
( 7, 'amb'),
( 10, 'amb'),
( 12, 'xxx'),
( 13, 'clim'),
( 15, 'clim'),
( 20, 'clim'),
( 22, 'amb'),
( 25, 'amb');
The desired result:
fromdate enddate test
-------- ------- ----
1 3 clim
7 10 amb
12 12 xxx
13 20 clim
22 25 amb
And the best suggestion from the thread on how to achieve that:
CREATE INDEX t_x ON t(date);
WITH x AS (
SELECT
t.date AS date,
t.test AS test,
max(t2.date) AS key2
FROM t
LEFT JOIN t t2
ON t2.date < t.date
AND t2.test <> t.test
GROUP BY
t.test,
t.date
)
SELECT
min(date) AS fromdate,
max(date) AS enddate,
test
FROM x
GROUP BY
x.key2,
x.test
ORDER BY
fromdate,
enddate,
test;
(It wasn't exactly the above code - I think it used sub-selects instead of the common table expression).
I guess the index makes the self-join efficient, but I was wondering if window functions could be used to eliminate that. I came up with this:
WITH
a AS (
SELECT
date,
test,
lag(test) OVER ( ORDER BY date) IS NOT test AS la
FROM t
),
b AS (
SELECT
date,
test,
sum(la) OVER ( ORDER BY date) AS s
FROM a
)
SELECT
min(date),
max(date),
test
FROM b
GROUP BY s
ORDER BY 1;
The query plan indicates that this only scans the index, but it does require several more coroutines and an extra sort. Is the trade-off likely to be worth it? Perhaps some experienced minds on the forum who might be inspired to comment on suitability or improvements - any takers?
This is purely a theoretical interest-based exercise: I don't have any real data or benchmarks to run.
(2.2) By Rico Mariani (rmariani) on 2022-03-24 04:25:56 edited from 2.1 in reply to 1 [source]
Custom aggregations are often expensive in SQL and cheap elsewise. If you were doing this a lot a UDF for the aggregation would probably be worth your while. Even with window functions I found this cumbersome and you should be able to do this easily in one pass over the data. Whatever you're using to read from SQLite is probably better equipped to do this. This is CG/SQL for instance.
-- printf is some varargs thing, just call it
declare proc printf no check;
create proc thingy ()
begin
create table t(
date integer,
test text
);
insert into t
values
( 1, 'clim'),
( 3, 'clim'),
( 7, 'amb'),
( 10, 'amb'),
( 12, 'xxx'),
( 13, 'clim'),
( 15, 'clim'),
( 20, 'clim'),
( 22, 'amb'),
( 25, 'amb');
let first_row := true;
declare c cursor for select * from t;
declare _min cursor like c;
declare _prev cursor like c;
loop fetch c
begin
if first_row then
fetch _min from c;
fetch _prev from c;
set first_row := false;
continue;
end if;
if c.test = _prev.test then
fetch _prev from c;
continue;
end if;
-- found a max, print and reset
call printf("%d %d %s\n", _min.date, _prev.date, _prev.test);
fetch _min from c;
fetch _prev from c;
end;
-- if anything pending, print and reset
if not first_row then
call printf("%d %d %s\n", _min.date, _prev.date, _prev.test);
end if;
end;
Any kind of approach like this is gonna be way cheaper than indexing or sorting or whatever... especially if it's not a toy amount of data.