I was looking for a SQL-only solution. I currently do not have enough extension building experience to pick the right solution. I did find a link to a related example from MS but left that alone for now. I have a small working example and it seems to work... with only two test cases:
```
drop table if exists sample;
create table sample(name,val);
insert into sample values ('ten', 10);
insert into sample values ('fifty', 50);
insert into sample values ('fifty', 50);
insert into sample values ('fifty', 50);
insert into sample values ('fifty', 50);
insert into sample values ('fifty', 50);
insert into sample values ('hundred', 100);
insert into sample values ('hundred', 101);
insert into sample values ('hundred', 102);
insert into sample values ('hundred', 103);
insert into sample values ('hundred', 150);
insert into sample values ('2hundred', 200);
insert into sample values ('2hundred', 225);
insert into sample values ('2hundred', 250);
insert into sample values ('3hundred', 300);
select * from sample;
select '';
-- k-mean
-- The algorithm works as follow:
-- Step 1: Choose groups in the feature plan randomly
-- Step 2: Minimize the distance between the cluster center and the different observations (centroid). It results in groups with observations
-- Step 3: Shift the initial centroid to the mean of the coordinates within a group.
-- Step 4: Minimize the distance according to the new centroids. New boundaries are created. Thus, observations will move from one group to another
-- Repeat until no observation changes groups
-- Step 1: Choose groups in the feature plan randomly
drop table if exists centers;
create table centers (k, center);
insert into centers (k,center)
select k, avg(val) as center from (select ntile(4) over r as k, val from sample window r as (order by val)) group by k ;
select * from centers order by k;
-- Step 2: Minimize the distance between the cluster center and the different observations (centroid). It results in groups with observations
-- Step 3: Shift the initial centroid to the mean of the coordinates within a group.
-- Step 4: Minimize the distance according to the new centroids. New boundaries are created. Thus, observations will move from one group to another
update centers as ce set center=n.center
from (
select k, avg(val) center
from (
select *
from (
select c.k, s.val, abs(s.val-c.center) d, row_number() over r as rowno
from sample s, centers c
window r as (partition by s.val order by abs(s.val-c.center) )
) as x
where rowno=1
) as y
group by k
) as n
where ce.k=n.k;
select * from centers order by k;
-- print the results
select k, min(val) minval, max(val) maxval
from (
select c.k, s.val, abs(s.val-c.center) d, row_number() over r as rowno
from sample s, centers c
window r as (partition by s.val order by abs(s.val-c.center) )
) as x
where rowno=1
group by k
order by 1;
```
In step 1 above I used `ntile()` to select the "random" centroids. I updated the centers table with:
```
delete from centers;
insert into centers values (1,10),(2,50),(3,75),(4,100);
```
And after repeating steps 2-4 I go to the same set from the first run.
[a] this could be converted to a recursive CTE
[b] `k` was just a guess but maybe it should be slightly more deterministic
[c] and it needs more testing
thanks again