SQLite Forum

Howto group most similar values
Login

Howto group most similar values

(1) By Heribert (derhexer) on 2021-12-02 09:37:34 [link] [source]

I've following simple table: CREATE TABLE Sampled (ID INTEGER AUTOINCREMENT, Value REAL)
This table will be filled with measurement values.
Now i want to know which values are the most similar values of all.

Similar is a 'Value' if its value is between -5% and +5% of another value.

e.g.
ID Value 
1  20.1
2  20.4
3  25.1
4  100.2
5  20.2

Result:
MostSimilar Average
3           20.23
1           25.1
1           100.2

Thanks for any help!

(2) By Heribert (derhexer) on 2021-12-02 10:01:25 in reply to 1 [link] [source]

Found it...

Select  Count(*) AS MostSimilar , AVG(A.Value) As Average

from    Sampled A join
        Sampled B on A.Value >= (B.Value*0.95) AND A.Value < (B.Value*1.05) 

group by A.ID ORDER BY MostSimilar DESC

(3) By Ryan Smith (cuz) on 2021-12-02 10:24:28 in reply to 1 [link] [source]

At last, an interesting question :)

This is one way of doing it. I've added some more information about the values, and also added another Zero value to make sure the logic doesn't break ever:

  -- SQLite version 3.35.4  [ Release: 2021-04-02 ]  on  SQLitespeed version 2.1.3.11.
================================================================================================

CREATE TABLE Sampled(ID INTEGER PRIMARY KEY AUTOINCREMENT, Value REAL);

INSERT INTO Sampled(ID, Value) VALUES
 (1,  20.1)
,(2,  20.4)
,(3,  25.1)
,(4, 100.2)
,(5,  20.2)
,(6,   0.0)
;

WITH avgs(values_in_group, values_average, values_min, values_max) AS (
	SELECT COUNT(*), AVG(B.Value), MIN(B.Value), MAX(B.Value)
	  FROM Sampled AS A
	  JOIN Sampled AS B ON (ABS(B.Value-A.Value)/NULLIF(A.Value,0)) < 0.05
	 GROUP BY A.ID
 )
 SELECT DISTINCT values_in_group, values_average, values_min, values_max
   FROM avgs
  ORDER BY values_in_group DESC
;


  -- values_- |                    |values_- |values_-
  -- in_group |      values_average|      min|      max
  -- ---------|--------------------|---------|---------
  --     3    |    20.2333333333333|     20.1|     20.4
  --     1    |                25.1|     25.1|     25.1
  --     1    |               100.2|    100.2|    100.2

Hope that is clear, but feel free to ask if anything isn't.

(5) By Heribert (derhexer) on 2021-12-02 10:31:41 in reply to 3 [link] [source]

Thx. Mine worked not correct. Now i have to interiorize how your perfectly working solution works.

Big Thanks.

(8) By Ryan Smith (cuz) on 2021-12-02 11:06:00 in reply to 5 [link] [source]

If you need to internalize this and the CTE is in you way, here it is without the CTE and the extra fields, using a sub-query:

SELECT DISTINCT values_in_group, values_average
   FROM (
   	SELECT COUNT(*) AS values_in_group, AVG(B.Value) AS values_average
	  FROM Sampled AS A
	  JOIN Sampled AS B ON (ABS(B.Value-A.Value)/NULLIF(A.Value,0)) < 0.05
	 GROUP BY A.ID
	) AS avgs
  ORDER BY values_in_group DESC
;


  -- values_- |                    
  -- in_group |      values_average
  -- ---------|--------------------
  --     3    |    20.2333333333333
  --     1    |                25.1
  --     1    |               100.2

(13) By Heribert (derhexer) on 2021-12-02 13:03:55 in reply to 8 [link] [source]

Thanks Rayn. The version with the additional min/max information is very handy - at least i need this values.

(10.1) By Gunter Hick (gunter_hick) on 2021-12-02 11:46:53 edited from 10.0 in reply to 3 [source]

Can you improve on this Ryan?

insert into sampled(value) values (20.1),(21.106),(22.16),(20.4),(25.1),(100.2),(20.2);

select a.id,a.value,count(),avg(b.value),min(b.value), max(b.value),group_concat(b.id) from sampled a join sampled b on (abs(b.value-a.value)/nullif(a.value,0)) < 0.05 group by a.id order by 3 desc;
id          value  count()  avg(b.value)  min(b.value)  max(b.value)  group_concat(b.id)
----------  -----  -------  ------------  ------------  ------------  ------------------

2           21.10  5        20.7932       20.1          22.16         1,2,3,4,7

4           20.4   4        20.4515       20.1          21.106        1,2,4,7

7           20.2   4        20.4515       20.1          21.106        1,2,4,7

1           20.1   3        20.233333333  20.1          20.4          1,4,7

3           22.16  2        21.633        21.106        22.16         2,3

5           25.1   1        25.1          25.1          25.1          5

6           100.2  1        100.2         100.2         100.2         6

(11.1) By Ryan Smith (cuz) on 2021-12-02 12:48:24 edited from 11.0 in reply to 10.1 [link] [source]

Nope, that works well - other than adding the Distinct and not showing the initial ID and Value, which would prevent the line with id 4 and id 7 (which are essentially duplicate) to both be shown.

(14) By Gunter Hick (gunter_hick) on 2021-12-02 13:16:40 in reply to 11.1 [link] [source]

My point is that the statement yields a list of "candidate clusters" which can - or even cannot - be combined in different ways to yield valid (i.e. each original point is in exactly one cluster) answers to the original questions.

(1,2,3,4,7),(5),(6) has the cluster with the most members

(1,4,7),(2,3),(5),(6) has the "tightest spread"

(1,2,4,7),(3),(5),(6) has the most common cluster (found twice), but forces id 3 (value 22.16) to be in it's own cluster which isn't even on the candidate list. Sort of like getting ostracized in ancient greece - most voters want you out.

(15) By anonymous on 2021-12-02 16:05:37 in reply to 3 [link] [source]

Maybe:

SELECT
	s1.value,
	s2.value, 
	MAX( (s1.value/s2.value), (s2.value/s1.value) )*100.0 AS pct
FROM sampled s1
JOIN sampled s2 ON NOT(s1.id=s2.id)
WHERE pct<=105

Expressing the percentage always in terms > 100%, note this used the non-aggregate version of the MAX function.

MAX( (s1.value/s2.value), (s2.value/s1.value) )*100.0 AS pct

Don't compare an item with itself:

JOIN sampled s2 ON NOT(s1.id=s2.id)

Only return the entries where one is no more than 5% larger than the other:

WHERE pct<=105

Returns:

20.1	20.4	101.492537313433
20.1	20.2	100.497512437811
20.4	20.1	101.492537313433
20.4	20.2	100.990099009901
20.2	20.1	100.497512437811
20.2	20.4	100.990099009901

Note: not dealing with NULLs explicitly.

(16) By anonymous on 2021-12-02 19:23:41 in reply to 15 [link] [source]

Agree with the comments here: the problem is open to several interpretations here. The following (I think) groups the closest values together.

I added in two outlier (value=100) entries - which show up in 'group-1' - with very different values : what they have in common is their 'closeness' to their neighbours.

As also noted on the forum already - items belong to multiple groups also.

Anyways: here's my effort - sure this could be refactored as well.

WITH
	compare(s1_id, s2_id, s1_value, s2_value, pct) AS (
SELECT
	s1.id AS s1_id,
	s2.id AS s2_id,
	s1.value AS s1_value,
	s2.value AS s2_value,
	100.0 * (MAX(s1.value,s2.value) /  MIN(s1.value,s2.value)) AS pct
FROM sampled s1
JOIN sampled s2 ON NOT(s1.id=s2.id)
WHERE pct<=105),
	remove_dups(mn_id, mx_id, mn_val, mx_val, pct) AS (
SELECT
	MIN(s1_id,s2_id) AS mn_id,
	MAX(s1_id,s2_id) AS mx_id,
	MIN(s1_value,s2_value) AS mn_val,
	MAX(s1_value,s2_value) AS mx_val,
	pct
FROM compare GROUP BY mn_id,mx_id)
SELECT *, row_number() OVER w1 AS item_id, dense_rank() OVER w2 AS group_id FROM remove_dups
WINDOW
	w1 AS (PARTITION BY pct RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
	w2 AS (ORDER BY pct RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
ORDER BY group_id;

Abridged output:

mn_id mx_id mn_val mx_val pct item_id group_id
1 8 20.1 20.1 100.0 1 1
2 11 20.4 20.4 100.0 2 1
5 9 20.2 20.2 100.0 3 1
24 25 100.0 100.0 100.0 4 1
4 24 100.0 100.2 100.2 1 2
4 25 100.0 100.2 100.2 2 2
22 23 21.5 21.6 100.46511627907 1 3
21 22 21.4 21.5 100.467289719626 1 4
20 21 21.3 21.4 100.469483568075 1 5
19 20 21.2 21.3 100.471698113208 1 6
18 19 21.1 21.2 100.473933649289 1 7
17 18 21.0 21.1 100.47619047619 1 8
16 17 20.9 21.0 100.478468899522 1 9
15 16 20.8 20.9 100.480769230769 1 10
14 15 20.7 20.8 100.48309178744 1 11

[...]

INSERT INTO "Sampled" ("ID","Value") VALUES (1,20.1),
 (2,20.4),
 (3,25.1),
 (4,100.2),
 (5,20.2),
 (6,0.0),
 (7,20.0),
 (8,20.1),
 (9,20.2),
 (10,20.3),
 (11,20.4),
 (12,20.5),
 (13,20.6),
 (14,20.7),
 (15,20.8),
 (16,20.9),
 (17,21.0),
 (18,21.1),
 (19,21.2),
 (20,21.3),
 (21,21.4),
 (22,21.5),
 (23,21.6),
 (24,100.0),
 (25,100.0);

(17) By anonymous on 2021-12-02 19:46:14 in reply to 16 [link] [source]

Nope: this is arbitrarily assigning a specific exact 'closeness' (like 100.2) to pairs; rather than counting-up within a range. (Which is presumably what is needed).

(4) By Gunter Hick (gunter_hick) on 2021-12-02 10:27:59 in reply to 1 [link] [source]

The concept of "similar" is ill defined.

asql> with val(x) as (values (20.1) , (21.106), (22.16)) select x*0.95 lower_limit,x*1.0 middle ,x*1.05 upper_limit from val;
lower_limit  middle  upper_limit
-----------  ------  -----------
19.095       20.1    21.105
20.0507      21.106  22.1613
21.052       22.16   23.268

20.1 thinks it should be alone in one group
21.106 thinks they all belong together
22.16 thinks 20.1 should be left out.

Even if changed to mean "the smaller value is at least 95% of the larger value" the relation "similar" is not transitive (a similar b AND b similar c does not imply a similar c).

(6) By Heribert (derhexer) on 2021-12-02 10:33:19 in reply to 4 [link] [source]

Yes - you are right. My solution isn't working correct. Ryan's solution works as wished.

(7.2) By Ryan Smith (cuz) on 2021-12-02 10:52:50 edited from 7.1 in reply to 6 [link] [source]

I think Gunter's point was more about questioning the mathematical correctness of your request than your solution. To make his point clear, see my solution again here but with a lot more "similar" values, is this really what you expect to see?:

 -- SQLite version 3.35.4  [ Release: 2021-04-02 ]  on  SQLitespeed version 2.1.3.11.
================================================================================================

CREATE TABLE Sampled(ID INTEGER PRIMARY KEY AUTOINCREMENT, Value REAL);

INSERT INTO Sampled(ID, Value) VALUES
 ( 1,  20.1)
,( 2,  20.4)
,( 3,  25.1)
,( 4, 100.2)
,( 5,  20.2)
,( 6,   0.0)
,( 7,  20.0)
,( 8,  20.1)
,( 9,  20.2)
,(10,  20.3)
,(11,  20.4)
,(12,  20.5)
,(13,  20.6)
,(14,  20.7)
,(15,  20.8)
,(16,  20.9)
,(17,  21.0)
,(18,  21.1)
,(19,  21.2)
,(20,  21.3)
,(21,  21.4)
,(22,  21.5)
,(23,  21.6)
;

WITH avgs(values_in_group, values_average, values_min, values_max) AS (
	SELECT COUNT(*), AVG(B.Value), MIN(B.Value), MAX(B.Value)
	  FROM Sampled AS A
	  JOIN Sampled AS B ON (ABS(B.Value-A.Value)/NULLIF(A.Value,0)) < 0.05
	 GROUP BY A.ID
 )
 SELECT DISTINCT values_in_group, values_average, values_min, values_max
   FROM avgs
  ORDER BY values_in_group DESC
;


  -- values_- |                    |values_- |values_- 
  -- in_group |      values_average|      min|      max
  -- ---------|--------------------|---------|---------
  --     20   |              20.715|     20.0|     21.6
  --     19   |    20.6684210526316|     20.0|     21.5
  --     19   |    20.7526315789474|     20.1|     21.6
  --     18   |    20.6222222222222|     20.0|     21.4
  --     17   |    20.5764705882353|     20.0|     21.3
  --     17   |    20.8294117647059|     20.2|     21.6
  --     16   |            20.53125|     20.0|     21.2
  --     15   |    20.4866666666667|     20.0|     21.1
  --     15   |    20.9133333333333|     20.3|     21.6
  --     14   |    20.9571428571429|     20.4|     21.6
  --     13   |                20.4|     20.0|     20.9
  --     12   |               21.05|     20.5|     21.6
  --     11   |                21.1|     20.6|     21.6
  --     1    |                25.1|     25.1|     25.1
  --     1    |               100.2|    100.2|    100.2

Some values are similar (within 5%) to values above them and below them, but not to values more than 5% away, so a lot of small groupings then form containing values from multiple sets of "close" values.

A better way to do this is to predefine the categories (say 1, 1.5, 2, 2.5 etc...) and then see how many values are "close" to the predefined categories. Say if you want that, we can show how to do that too.

(9) By Andreas Kupries (andreas-kupries) on 2021-12-02 11:34:19 in reply to 7.2 [link] [source]

Generally this looks to me to be going towards "clustering" of values, of which there are lots of algorithms. At least this problem seems to be "only" one-dimensional.

(12) By Heribert (derhexer) on 2021-12-02 12:59:15 in reply to 7.2 [link] [source]

I've to check a group of temp sensors (max. 30), whether they shows similar values  during a precheck in a production line. During this precheck its impossible to get any reference to check against. In a later phase they are checked against a reference value.
So this solution works for me - but I also recognize the problem that this algorithm has when analzying sampled data.