Basic clustering of data

(1) By anonymous on 2024-11-29 09:04:53 [source]

I would like to automatically cluster dates together. If the dates are further than 5 days apart, I'd like them to form a new group.

create temp table some_data(action DATE);

-- First set
insert into temp.some_data VALUES ('2000-01-01');
insert into temp.some_data VALUES ('2000-01-01');
insert into temp.some_data VALUES ('2000-01-02');
insert into temp.some_data VALUES ('2000-01-04');

-- Second set - note this crosses months
insert into temp.some_data VALUES ('2000-01-31');
insert into temp.some_data VALUES ('2000-02-01');
insert into temp.some_data VALUES ('2000-02-02');

--Third set
insert into temp.some_data VALUES ('2000-03-01');
insert into temp.some_data VALUES ('2000-03-02');

If I try a simple group by the month:

	SELECT
			MIN(my_date) as start_date,
			MAX(my_date) as end_date
		 FROM
			temp.some_data
		 GROUP BY
			DATE(my_date, 'start of month');

Result:

start_date	end_date
2000-01-01	2000-01-31
2000-02-01	2000-02-02
2000-03-01	2000-03-02

But the desired outcome is:

start_date	end_date
2000-01-01	2000-01-04  -- this stops on the 4th
2000-01-31	2000-02-02  -- this crosses months
2000-03-01	2000-03-02

Conceptually this seems simple, and I've tried playing WITH RECURSIVE and CASE, which I figure should do it, but just ended up with a (often infinite) mess.

Any thoughts?

(2) By Mark Lawrence (mark) on 2024-11-29 09:57:26 in reply to 1 [link] [source]

Hah, interesting - this is almost the same situation as another recent thread¹, which I will borrow on heavily to propose the following:

create temp table some_data(action DATE);

-- First set
insert into temp.some_data VALUES ('2000-01-01');
insert into temp.some_data VALUES ('2000-01-01');
insert into temp.some_data VALUES ('2000-01-02');
insert into temp.some_data VALUES ('2000-01-04');

-- Second set - note this crosses months
insert into temp.some_data VALUES ('2000-01-31');
insert into temp.some_data VALUES ('2000-02-01');
insert into temp.some_data VALUES ('2000-02-02');

--Third set
insert into temp.some_data VALUES ('2000-03-01');
insert into temp.some_data VALUES ('2000-03-02');


WITH
    edge_detect as (
        SELECT action,
            julianday(action)-julianday(lag(action)
                OVER (ORDER BY action)) <= 5 AS ingrp
        FROM some_data
    ),
    build_grouping AS (
        SELECT action, COUNT(*) FILTER (WHERE ingrp IS NOT 1)
                OVER (ORDER BY action,ingrp NULLS FIRST) AS g
        FROM edge_detect
    )
SELECT MIN(action) as Start, MAX(action) AS End
FROM build_grouping
GROUP BY g;

This time I don't think it can be done in a single stage.

Fossil Meta Query: link formatting doesn't seem to work in footnote labels?

^{^} https://sqlite.org/forum/forumpost/e670a1c69e

(3) By Mark Lawrence (mark) on 2024-11-29 10:13:04 in reply to 2 [link] [source]

This version is perhaps slightly more efficient by ordering the first SELECT, if my understanding of a new query optimization in the latest release is correct:

WITH
    edge_detect as (
        SELECT action,
            julianday(action)-julianday(lag(action) OVER ()) <= 5 AS ingrp
        FROM some_data
        ORDER BY action
    ),
    build_grouping AS (
        SELECT action, COUNT(*) FILTER (WHERE ingrp IS NOT 1)
            OVER (ROWS UNBOUNDED PRECEDING) AS g
        FROM edge_detect
    )
SELECT MIN(action) as Start, MAX(action) AS End
FROM build_grouping
GROUP BY g;

(4) By anonymous on 2024-12-04 09:51:26 in reply to 3 [link] [source]

Thanks Mark! That works like a charm.

On 3.46.1, against a table with about 20 million records and no index on this field, I can confirm the second (with the ORDER BY) is faster than the first (150s versus ~210s).

A further optimisation is to change the very first SELECT action (line 3 on both) into SELECT DATE(action), as only the date component is required. This saves 10s from the second version (down to 140s) and a big 40s from the first (down to ~170s).

Thanks again!

(7) By Mark Lawrence (mark) on 2024-12-04 16:00:52 in reply to 4 [link] [source]

Bo is right. The first lag needs to operate on sorted rows:

WITH
    edge_detect as (
        SELECT action, julianday(action) - julianday(
            lag(action) OVER ( ORDER BY action)) <= 5 AS ingrp
        FROM some_data
        ORDER BY action
    ),
    build_grouping AS (
        SELECT action, COUNT(*) FILTER (WHERE ingrp IS NOT 1)
            OVER (ROWS UNBOUNDED PRECEDING) AS g
        FROM edge_detect
    )
SELECT MIN(action) as Start, MAX(action) AS End
FROM build_grouping
GROUP BY g;

Honest question to minds less ignorant than mine: should OVER (ROWS UNBOUNDED PROCEDING) also require an explicit ORDER BY to guarantee correct behaviour? Because if not, I don't understand why OVER() in the first doesn't operate on the output row order.

I'm not sure I agree anonymous with your use of date(). I used julianday() to get day values that work with arithmatic. Have you tested arithmatic on dates?

SELECT date('2024-03-05')           - date('2024-03-01') AS Wrong;
SELECT julianday('2024-03-05') - julianday('2024-03-01') AS Right;
-- ┌───────┐
-- │ Wrong │
-- ├───────┤
-- │ 0     │
-- └───────┘
-- ┌───────┐
-- │ Right │
-- ├───────┤
-- │ 4.0   │
-- └───────┘

(8) By Mark Lawrence (mark) on 2024-12-04 16:13:56 in reply to 7 [link] [source]

Please ignore my comments regarding date(). I obvious mistook where you were applying it. Your example data only has dates but presumably your actual data has more resolution.

(5.1) By Bo Lindbergh (_blgl_) on 2024-12-04 14:38:00 edited from 5.0 in reply to 3 [link] [source]

That version does the wrong thing when dates aren't unique. Try it on a database with the same sample data inserted twice and you'll see.

Edited to add: it also fails in a different way with pragma reverse_unordered_selects=1;

(6) By Bo Lindbergh (_blgl_) on 2024-12-04 14:00:05 in reply to 1 [link] [source]

In your example data, all clusters are associated with month breaks. If that holds for the real data as well, things could be done in a single step: just round to the nearest month break rather than the previous one.