SQLite Forum

detecting clumps
Login
I was looking for a SQL-only solution. I currently do not have enough extension building experience to pick the right solution. I did find a link to a related example from MS but left that alone for now. I have a small working example and it seems to work... with only two test cases:

```
drop table if exists sample;
create table sample(name,val);
insert into sample values ('ten', 10);
insert into sample values ('fifty', 50);
insert into sample values ('fifty', 50);
insert into sample values ('fifty', 50);
insert into sample values ('fifty', 50);
insert into sample values ('fifty', 50);
insert into sample values ('hundred', 100);
insert into sample values ('hundred', 101);
insert into sample values ('hundred', 102);
insert into sample values ('hundred', 103);
insert into sample values ('hundred', 150);
insert into sample values ('2hundred', 200);
insert into sample values ('2hundred', 225);
insert into sample values ('2hundred', 250);
insert into sample values ('3hundred', 300);
select * from sample;
select '';


-- k-mean
-- The algorithm works as follow:
-- Step 1: Choose groups in the feature plan randomly
-- Step 2: Minimize the distance between the cluster center and the different observations (centroid). It results in groups with observations
-- Step 3: Shift the initial centroid to the mean of the coordinates within a group.
-- Step 4: Minimize the distance according to the new centroids. New boundaries are created. Thus, observations will move from one group to another
-- Repeat until no observation changes groups


-- Step 1: Choose groups in the feature plan randomly
drop table if exists centers;
create table centers (k, center);
insert into centers (k,center)
select k, avg(val) as center from (select ntile(4) over r as k, val from sample window r as (order by val)) group by k ;
select * from centers order by k;


-- Step 2: Minimize the distance between the cluster center and the different observations (centroid). It results in groups with observations
-- Step 3: Shift the initial centroid to the mean of the coordinates within a group.
-- Step 4: Minimize the distance according to the new centroids. New boundaries are created. Thus, observations will move from one group to another
update centers as ce set center=n.center 
from (
        select k, avg(val) center 
        from (
                select * 
                from (
                        select c.k, s.val, abs(s.val-c.center) d, row_number() over r as rowno 
                        from sample s, centers c 
                        window r as (partition by s.val order by abs(s.val-c.center) )
                ) as x 
                where rowno=1
        ) as y
        group by k
) as n 
where ce.k=n.k;

select * from centers order by k;

-- print the results
select k, min(val) minval, max(val) maxval 
from (  
        select c.k, s.val, abs(s.val-c.center) d, row_number() over r as rowno
        from sample s, centers c 
        window r as (partition by s.val order by abs(s.val-c.center) )
) as x
where rowno=1
group by k
order by 1;
```

In step 1 above I used `ntile()` to select the "random" centroids. I updated the centers table with:

```
delete from centers;
insert into centers values (1,10),(2,50),(3,75),(4,100);
```

And after repeating steps 2-4 I go to the same set from the first run.

[a] this could be converted to a recursive CTE

[b] `k` was just a guess but maybe it should be slightly more deterministic

[c] and it needs more testing

thanks again