Aggregate function by multiple columns
(1) By Alexandr Burdiyan (burdiyan) on 2022-06-10 16:12:32 [link] [source]
I'm pretty sure the title of this thread doesn't explain clearly what I'm trying to do, but I couldn't find a better one.
I wonder if there is a way in SQLite to make a GROUP BY query sort the rows within each group and then select the first one in each group? I guess, that's only one possible way to do what I want, so I provide and example and the desired output to see if there's anything else I missed.
Given the following table:
CREATE TABLE vals (
clock INTEGER,
actor TEXT,
key TEXT,
value TEXT,
UNIQUE (key, clock, actor)
);
and the following data:
INSERT INTO vals VALUES (1, "alice", "title", "hello");
INSERT INTO vals VALUES (1, "bob", "title", "Hello World!");
INSERT INTO vals VALUES (3, "alice", "title", "Last Title!");
INSERT INTO vals VALUES (3, "carol", "title", "Carol Title!");
INSERT INTO vals VALUES (3, "carol", "subtitle", "Carol SubTitle!");
INSERT INTO vals VALUES (4, "alice", "title", "The Final Last Title!");
I want to group rows by key, and select the latest value within each group. Only one resulting row per key. To check which value is the latest one it should order the grouped values as clock DESC, actor DESC
.
Intuitively I think of it as a MAX
aggregation but checking multiple columns.
I tried doing it using window function, but it ends up a bit weird, because semantically what I want feels more like an aggregation than windowing.
This is the windowing query I used:
SELECT DISTINCT
key,
first_value(value) over win,
first_value(clock) over win,
first_value(actor) over win
FROM vals
WINDOW win AS (PARTITION BY key ORDER BY clock DESC, actor DESC);
I also tried using HAVING
in the group by like so:
SELECT
key,
value,
clock,
actor
FROM vals
GROUP BY key HAVING clock = MAX(clock) OR actor = MAX(actor);
but it doesn't return the correct result.
I assume this could be implemented using a custom aggregate function, but I wonder if there's a simpler way using plain SQL.
(2) By David Raymond (dvdraymond) on 2022-06-10 16:37:09 in reply to 1 [link] [source]
Something like this? select clock, actor, key, value from ( select *, rank() over (partition by key order by clock desc, actor desc) as filter_on_this from vals ) as foo where filter_on_this = 1; clock|actor|key|value 3|carol|subtitle|Carol SubTitle! 4|alice|title|The Final Last Title!
(4.1) By Alexandr Burdiyan (burdiyan) on 2022-06-10 18:42:56 edited from 4.0 in reply to 2 [link] [source]
Thanks a lot! This does the job and looks more clear than my windowing approach.
(3) By David Raymond (dvdraymond) on 2022-06-10 16:48:38 in reply to 1 [link] [source]
...and a reminder to use single quotes for string literals, as that errors out when double quoted strings are compiled off as recommended.
(5) By Alexandr Burdiyan (burdiyan) on 2022-06-10 18:42:35 in reply to 3 [link] [source]
Thanks! I always forget about it when writing queries manually, and my sqlite3 shell doesn't complain about them so I leave them like that :)
(6) By Alexandr Burdiyan (burdiyan) on 2022-06-12 12:42:18 in reply to 1 [source]
I think I came up with an idea of a simpler query. Assuming value of clock will never be larger than 8 digits (or any other size), I could format (clock, actor) as a sortable string and use it in MAX
aggregate function.
This is what I ended up with:
SELECT *, MAX(printf('%08d-%s', clock, actor))
FROM vals
GROUP BY key;