detecting clumps

(1) By rbucker on 2020-08-30 22:35:22 [link]

Most teachers grade on the traditional grading scale... 90-100 is an A, 80-89 is a B... and so on.

There are variations in grading which "they" call the curve... in one such mechanism it's called "clumping" or "clumping edge detection". Sadly I've only found the one reference and there is nothing about SQL...

It seems to me that there should be something ... In my case I'm doing grades but it would be quite useful in processing logfiles comparing the clumps using dates.

Thanks.

(hope I'm making sense)

(2) By Warren Young (wyoung) on 2020-08-31 06:00:37 in reply to 1 [link]

I couldn’t find any reference to clumped school grading like you describe, but I’d guess these teachers are reinventing [k-means clustering][1], badly. I’d expect to find a certain amount of subjectivity in actual practice, for example, even when a given tutorial implies it’s a rigorous method to assign grades.

This is an NP-hard algorithm, so I wouldn’t expect to find a SQL implementation. There are many implementations in other programming languages better suited to the task linked at the bottom of that article. 

[1]: https://en.wikipedia.org/wiki/K-means_clustering

(3) By rbucker on 2020-08-31 11:41:57 in reply to 2 [link]

That first sentence reads exactly as I imagine it would. Awesome!

The basic definition I had was based on a single interaction 35 or 36 years ago.

A more modern example would be something like an amazon product search by price where the price range(s) are in the sidebar. Something had to decide what the ranges are.

Thanks again.

(4) By David (dcf) on 2020-08-31 13:14:21 in reply to 1 [link]

Would one of the histogram algorithms or packages work for you?  (A Google search should turn up many of these for SQL, generally, and sqlite specifically.  ie, search for 'histogram sql' or 'histogram sqlite'.)  Many of these packages are flexible with regard to histogram 'boundaries'.

(5) By rbucker on 2020-08-31 14:21:36 in reply to 4

I was looking for a SQL-only solution. I currently do not have enough extension building experience to pick the right solution. I did find a link to a related example from MS but left that alone for now. I have a small working example and it seems to work... with only two test cases:

```
drop table if exists sample;
create table sample(name,val);
insert into sample values ('ten', 10);
insert into sample values ('fifty', 50);
insert into sample values ('fifty', 50);
insert into sample values ('fifty', 50);
insert into sample values ('fifty', 50);
insert into sample values ('fifty', 50);
insert into sample values ('hundred', 100);
insert into sample values ('hundred', 101);
insert into sample values ('hundred', 102);
insert into sample values ('hundred', 103);
insert into sample values ('hundred', 150);
insert into sample values ('2hundred', 200);
insert into sample values ('2hundred', 225);
insert into sample values ('2hundred', 250);
insert into sample values ('3hundred', 300);
select * from sample;
select '';


-- k-mean
-- The algorithm works as follow:
-- Step 1: Choose groups in the feature plan randomly
-- Step 2: Minimize the distance between the cluster center and the different observations (centroid). It results in groups with observations
-- Step 3: Shift the initial centroid to the mean of the coordinates within a group.
-- Step 4: Minimize the distance according to the new centroids. New boundaries are created. Thus, observations will move from one group to another
-- Repeat until no observation changes groups


-- Step 1: Choose groups in the feature plan randomly
drop table if exists centers;
create table centers (k, center);
insert into centers (k,center)
select k, avg(val) as center from (select ntile(4) over r as k, val from sample window r as (order by val)) group by k ;
select * from centers order by k;


-- Step 2: Minimize the distance between the cluster center and the different observations (centroid). It results in groups with observations
-- Step 3: Shift the initial centroid to the mean of the coordinates within a group.
-- Step 4: Minimize the distance according to the new centroids. New boundaries are created. Thus, observations will move from one group to another
update centers as ce set center=n.center 
from (
        select k, avg(val) center 
        from (
                select * 
                from (
                        select c.k, s.val, abs(s.val-c.center) d, row_number() over r as rowno 
                        from sample s, centers c 
                        window r as (partition by s.val order by abs(s.val-c.center) )
                ) as x 
                where rowno=1
        ) as y
        group by k
) as n 
where ce.k=n.k;

select * from centers order by k;

-- print the results
select k, min(val) minval, max(val) maxval 
from (  
        select c.k, s.val, abs(s.val-c.center) d, row_number() over r as rowno
        from sample s, centers c 
        window r as (partition by s.val order by abs(s.val-c.center) )
) as x
where rowno=1
group by k
order by 1;
```

In step 1 above I used `ntile()` to select the "random" centroids. I updated the centers table with:

```
delete from centers;
insert into centers values (1,10),(2,50),(3,75),(4,100);
```

And after repeating steps 2-4 I go to the same set from the first run.

[a] this could be converted to a recursive CTE

[b] `k` was just a guess but maybe it should be slightly more deterministic

[c] and it needs more testing

thanks again

(6) By rbucker on 2020-08-31 15:26:40 in reply to 5 [link]

I did some interesting testing...

```
-- TESTING
delete from centers; insert into centers values (1,10),(2,50),(3,75),(4,100);
delete from centers; insert into centers values (1,100),(2,75),(3,50),(4,10);
delete from centers; insert into centers values (1,100),(2,50),(3,75),(4,10);
delete from centers; insert into centers values (1,10),(2,20),(3,30),(4,40);
delete from centers; insert into centers values (1,1),(2,2),(3,3),(4,4);
```

In the last example where the centroids were `1,2,3,4` it only took 6 iterations to settle down.