Basic clustering of data
(1) By anonymous on 2024-11-29 09:04:53 [source]
I would like to automatically cluster dates together. If the dates are further than 5 days apart, I'd like them to form a new group.
create temp table some_data(action DATE);
-- First set
insert into temp.some_data VALUES ('2000-01-01');
insert into temp.some_data VALUES ('2000-01-01');
insert into temp.some_data VALUES ('2000-01-02');
insert into temp.some_data VALUES ('2000-01-04');
-- Second set - note this crosses months
insert into temp.some_data VALUES ('2000-01-31');
insert into temp.some_data VALUES ('2000-02-01');
insert into temp.some_data VALUES ('2000-02-02');
--Third set
insert into temp.some_data VALUES ('2000-03-01');
insert into temp.some_data VALUES ('2000-03-02');
If I try a simple group by the month:
SELECT
MIN(my_date) as start_date,
MAX(my_date) as end_date
FROM
temp.some_data
GROUP BY
DATE(my_date, 'start of month');
Result:
start_date end_date
2000-01-01 2000-01-31
2000-02-01 2000-02-02
2000-03-01 2000-03-02
But the desired outcome is:
start_date end_date
2000-01-01 2000-01-04 -- this stops on the 4th
2000-01-31 2000-02-02 -- this crosses months
2000-03-01 2000-03-02
Conceptually this seems simple, and I've tried playing WITH RECURSIVE and CASE, which I figure should do it, but just ended up with a (often infinite) mess.
Any thoughts?
(2) By Mark Lawrence (mark) on 2024-11-29 09:57:26 in reply to 1 [link] [source]
Hah, interesting - this is almost the same situation as another recent thread1, which I will borrow on heavily to propose the following:
create temp table some_data(action DATE);
-- First set
insert into temp.some_data VALUES ('2000-01-01');
insert into temp.some_data VALUES ('2000-01-01');
insert into temp.some_data VALUES ('2000-01-02');
insert into temp.some_data VALUES ('2000-01-04');
-- Second set - note this crosses months
insert into temp.some_data VALUES ('2000-01-31');
insert into temp.some_data VALUES ('2000-02-01');
insert into temp.some_data VALUES ('2000-02-02');
--Third set
insert into temp.some_data VALUES ('2000-03-01');
insert into temp.some_data VALUES ('2000-03-02');
WITH
edge_detect as (
SELECT action,
julianday(action)-julianday(lag(action)
OVER (ORDER BY action)) <= 5 AS ingrp
FROM some_data
),
build_grouping AS (
SELECT action, COUNT(*) FILTER (WHERE ingrp IS NOT 1)
OVER (ORDER BY action,ingrp NULLS FIRST) AS g
FROM edge_detect
)
SELECT MIN(action) as Start, MAX(action) AS End
FROM build_grouping
GROUP BY g;
This time I don't think it can be done in a single stage.
Fossil Meta Query: link formatting doesn't seem to work in footnote labels?
- ^ https://sqlite.org/forum/forumpost/e670a1c69e
(3) By Mark Lawrence (mark) on 2024-11-29 10:13:04 in reply to 2 [link] [source]
This version is perhaps slightly more efficient by ordering the first SELECT, if my understanding of a new query optimization in the latest release is correct:
WITH
edge_detect as (
SELECT action,
julianday(action)-julianday(lag(action) OVER ()) <= 5 AS ingrp
FROM some_data
ORDER BY action
),
build_grouping AS (
SELECT action, COUNT(*) FILTER (WHERE ingrp IS NOT 1)
OVER (ROWS UNBOUNDED PRECEDING) AS g
FROM edge_detect
)
SELECT MIN(action) as Start, MAX(action) AS End
FROM build_grouping
GROUP BY g;
(4) By anonymous on 2024-12-04 09:51:26 in reply to 3 [link] [source]
Thanks Mark! That works like a charm.
On 3.46.1, against a table with about 20 million records and no index on this field, I can confirm the second (with the ORDER BY
) is faster than the first (150s versus ~210s).
A further optimisation is to change the very first SELECT action
(line 3 on both) into SELECT DATE(action)
, as only the date component is required. This saves 10s from the second version (down to 140s) and a big 40s from the first (down to ~170s).
Thanks again!
(7) By Mark Lawrence (mark) on 2024-12-04 16:00:52 in reply to 4 [link] [source]
Bo is right. The first lag needs to operate on sorted rows:
WITH
edge_detect as (
SELECT action, julianday(action) - julianday(
lag(action) OVER ( ORDER BY action)) <= 5 AS ingrp
FROM some_data
ORDER BY action
),
build_grouping AS (
SELECT action, COUNT(*) FILTER (WHERE ingrp IS NOT 1)
OVER (ROWS UNBOUNDED PRECEDING) AS g
FROM edge_detect
)
SELECT MIN(action) as Start, MAX(action) AS End
FROM build_grouping
GROUP BY g;
Honest question to minds less ignorant than mine: should
OVER (ROWS UNBOUNDED PROCEDING)
also require an explicit ORDER BY
to guarantee correct behaviour? Because if not, I don't understand why
OVER()
in the first doesn't operate on the output row order.
I'm not sure I agree anonymous with your use of date()
. I used
julianday()
to get day values that work with arithmatic. Have you
tested arithmatic on dates?
SELECT date('2024-03-05') - date('2024-03-01') AS Wrong;
SELECT julianday('2024-03-05') - julianday('2024-03-01') AS Right;
-- ┌───────┐
-- │ Wrong │
-- ├───────┤
-- │ 0 │
-- └───────┘
-- ┌───────┐
-- │ Right │
-- ├───────┤
-- │ 4.0 │
-- └───────┘
(8) By Mark Lawrence (mark) on 2024-12-04 16:13:56 in reply to 7 [link] [source]
Please ignore my comments regarding date()
. I obvious mistook where
you were applying it. Your example data only has dates but presumably
your actual data has more resolution.
(5.1) By Bo Lindbergh (_blgl_) on 2024-12-04 14:38:00 edited from 5.0 in reply to 3 [link] [source]
That version does the wrong thing when dates aren't unique. Try it on a database with the same sample data inserted twice and you'll see.
Edited to add: it also fails in a different way with pragma reverse_unordered_selects=1;
(6) By Bo Lindbergh (_blgl_) on 2024-12-04 14:00:05 in reply to 1 [link] [source]
In your example data, all clusters are associated with month breaks. If that holds for the real data as well, things could be done in a single step: just round to the nearest month break rather than the previous one.