Query to find maximum number of values sharing a common key

(1) By John Dennis (jdennis) on 2022-02-09 08:12:26 [source]

I'm sorry, the title is probably not perfect which might explain my problem finding sample SQL queries.  

Here is my test data:

CREATE TABLE t1(rnd integer,line integer,pkey TEXT,primary key(rnd,line));
INSERT INTO t1 VALUES
   (1,1,'AAA'),(1,2,'BBB'),(1,3,'CCC'),(1,4,'DDD'),
   (2,1,'AAA'),(2,2,'EEE'),(2,3,'HHH'),(2,4,'DDD'),
   (3,1,'AAA'),(3,2,'FFF'),(3,3,'CCC'),(3,4,'HHH'),
   (4,1,'AAA'),(4,2,'CCC'),(4,3,'GGG'),(4,4,'DDD'),
   (5,1,'AAA'),(5,2,'GGG'),(5,3,'CCC'),(5,4,'DDD');

Where the "rnd" column identifies a set of data, each with four "pkey" values. You can see with a quick look that the maximum number of "pkey" cols which share a common "rnd" is 3 - AAA, CCC and DDD which each appear in rnds 1,3,4,5, so four occurrences. An expected answer might therefore be:

combine_cnt occur_cnt  rnds     pkeys
3           4          1,3,4,5  AAA,CCC,DDD

I don't really want just this top answer. It would be nice to be able to see the top # rows with an order by combine_cnt desc, occur_cnt desc.

In the real world the maximum number of combinations is not known, and neither is the number of times the combination might appear together. 

The real data has another two levels of key to be matched (more significant than rnd), there are 12 "lines" for each "rnd", and currently about 75,000 rows to consider.  But I am sure I can expand on any proffered solution. 

Any help gratefully appreciated.

(2) By Harald Hanche-Olsen (hanche) on 2022-02-09 08:43:43 in reply to 1 [link] [source]

Maybe I am slow, or something, but I can't quite follow your problem description. So let me tell you have I read it, and where it seems wrong:

the maximum number of "pkey" cols which share a common "rnd" is 3 - AAA, CCC and DDD

No, AAA, BBB, CCC, DDD all share a common “rnd”. That is four, not three. Admittedly, only one occurrence (rnd=1), but you did not ask for the largest number of occurrences?

which each appear in rnds 1,3,4,5, so four occurrences.

But DDD does not appear with rnd=3. Was that a typo, or I have misunderstood even more?

(3) By Gunter Hick (gunter_hick) on 2022-02-09 08:56:17 in reply to 1 [link] [source]

Your data matrix does not fit your description. All 5 rnd values have AAA in line 1.

(4) By John Dennis (jdennis) on 2022-02-09 09:06:33 in reply to 2 [link] [source]

No, AAA, BBB, CCC, DDD all share a common “rnd”. That is four, not three. Admittedly, only one occurrence (rnd=1), but you did not ask for the largest number of occurrences?

So they do. So the currect first answer returned should be 4 common keys, occuring 1 time.

But DDD does not appear with rnd=3. Was that a typo, or I have misunderstood even more?

Sorry. I must have gone crosseyed. AAA, CCC and DDD all appear in rnds 1, 4 and 5, so three occurrence.

(5) By John Dennis (jdennis) on 2022-02-09 09:16:21 in reply to 3 [link] [source]

Your data matrix does not fit your description. All 5 rnd values have AAA in line 1.

The first integer is the rnd, so AAA appears as line 1 in rnds 1,2,3,4,5. But I am trying to find the most pkey values which share a common rnd. AAA, BBB, CCC, DDD all share rnd 1 (which I missed stating); AAA, CCC and DDD are all present in rnds 1, 4, and 5. So the first row I wish to have returned should show 4 (for the common keys) and 1 (for the number of times they appears together), followed by 3 and 3 for the three common keys, occurring 3 times.

(6) By anonymous on 2022-02-09 14:17:20 in reply to 5 [link] [source]

rnd 4 has pkeys 'AAA','CCC','GGG','DDD'

rnd 5 has pkeys 'AAA','GGG','CCC','DDD'

So order matters, or you'd want these counted as one group appearing twice, correct?

(7) By John Dennis (jdennis) on 2022-02-10 02:56:26 in reply to 6 [link] [source]

rnd 4 has pkeys 'AAA','CCC','GGG','DDD'

rnd 5 has pkeys 'AAA','GGG','CCC','DDD'

So order matters, or you'd want these counted as one group appearing twice, correct?

Dear me, I had a bad day yesterday! The pkey values can appear in any order in a specific rnd, so these two rows would be counted as 2 occurrences of the combination of 4 pkeys.

(8) By John Dennis (jdennis) on 2022-02-10 03:24:55 in reply to 1 [link] [source]

I can get the answer I want in a multi-step process. For example, this query will return all possible combinations of three pkeys:

SELECT a.rnd,3,printf("%s,%s,%s",a.pkey,b.pkey,c.pkey) AS combo
FROM t1 a, t1 b, t1 c
WHERE a.rnd=b.rnd
  AND a.rnd=c.rnd
AND a.pkey<b.pkey
AND a.pkey<c.pkey
AND b.pkey<c.pkey;

so a series of select statements in a UNION can be used to populate a temporary table with all possible combinations of any length sets (1, 2, 3 or 4). This table can then be queried with a COUNT(*) and GROUP BY on the combo, and restricted to certain combination numbers, or sorted on the count to show the most common combos. 

However this becomes unwieldy. In the real world the union with the 12 rows for each rnd will result in a series of increasing self-joins, with a maximum of 11 in the last select. 

I guess what I am really asking is this: is there a clever way to remove the need for the series of 2, 3, 4, ... 10, 11-way self joins?

(9) By Kevin Martin (kev82) on 2022-02-12 14:25:34 in reply to 8 [link] [source]

I don't fully understand your problem, but it sounds like what you are looking for now is a way to generate all possible lexicographically ordered subsets of your pkeys, so you can then go and join a table of those subsets back to your initial data in some way.

The generation of subsets is relatively simple to do recursively, like so:

Given

drop table if exists t1;
CREATE TABLE t1(rnd integer,line integer,pkey TEXT,primary key(rnd,line));
INSERT INTO t1 VALUES
   (1,1,'AAA'),(1,2,'BBB'),(1,3,'CCC'),(1,4,'DDD'),
   (2,1,'AAA'),(2,2,'EEE'),(2,3,'HHH'),(2,4,'DDD'),
   (3,1,'AAA'),(3,2,'FFF'),(3,3,'CCC'),(3,4,'HHH'),
   (4,1,'AAA'),(4,2,'CCC'),(4,3,'GGG'),(4,4,'DDD'),
   (5,1,'AAA'),(5,2,'GGG'),(5,3,'CCC'),(5,4,'DDD');

Then the following will create for you a table of all subsets of pkeys ordered lexicograpically (I think, anyway - basic test looks about right). I have also included a count of how many elements are in the set.

--I have done this as a table as it has to be queried loads of
--times in the CTE, I've not put any indexes on it as it is very small
--and the CTE seems to run quick enough on your data
drop table if exists ordered_pkeys;
create temporary table ordered_pkeys as SELECT
  pkey,
  row_number() over (order by pkey) as id
from
  (select distinct pkey from t1);

drop view if exists calc_all_subsets;
create temporary view
  calc_all_subsets
as with 
  recurse (curidx, nelems, subset) as (SELECT
    0, 0, ''
  union all SELECT
    r.curidx+1,
    r.nelems + sq.incl,
    case
      when
        length(r.subset) == 0 or length(sq.pkey) == 0
      then
        r.subset || sq.pkey
    else
      printf('%s-%s', r.subset, sq.pkey)
    end
  FROM
    recurse as r
    join (select id, 0 as incl, '' as pkey from ordered_pkeys union all select id, 1, pkey from ordered_pkeys) as sq
      on r.curidx+1 = sq.id
      --avoid nothing+nothing=nothing duplicates                                       
      and r.nelems + sq.incl >= 1
  WHERE
    r.curidx < (select max(id) from ordered_pkeys))
  select nelems, subset from recurse;

drop table if exists pkey_subsets;
create table pkey_subsets as select * from calc_all_subsets;

drop table if exists ordered_pkeys;
drop view if exists calc_all_subsets;

select * from pkey_subsets;

(10) By John Dennis (jdennis) on 2022-02-13 02:52:32 in reply to 9 [link] [source]

I don't fully understand your problem, but it sounds like what you are looking for now is a way to generate all possible lexicographically ordered subsets of your pkeys, so you can then go and join a table of those subsets back to your initial data in some way.

Thanks for that nice solution, but it is just part of the problem. I need these strings of pkeys to be grouped by rnd, so the result will be all combinations of 1,2,3 and 4 pkeys for each rnd.

My solution, using a series of inner joins, looks ugly but works fine, however takes quite a bit of time on the production tables with 11 pkeys for each rnd. From the 34,000 candidate rows it generates 5.7 million combinations, needing 330Mb on disk (the whole database is a little over 30Mb). But a search for a combination for one single pkey, by restricting the generation of combinations to those for that pkey, the query runs quickly enough. Testing with the pkey which occurs most frequently the number of combinations generated is just 70,000, fast and manageable, and I will use that technique on the web site.

Requests for the arbitary "most common combination of 6" and the like will be done as special request, offline.

(11) By Kevin Martin (kev82) on 2022-02-13 08:37:47 in reply to 10 [link] [source]

Thanks for that nice solution, but it is just part of the problem. I need these strings of pkeys to be grouped by rnd,

I think this is easy enough to do by joining pkey_subsets to t1 with the condition the pkey is in the subset (instr(pkey_subsets, pkey)>0), grouping by the subset and rnd, and then having all the elements of the subset in the rnd (having count(*)=pkey_subsets.nelems)

To be honest it was after this join where I was unable to follow your requirement on exactly what should be counted.

so the result will be all combinations of 1,2,3 and 4 pkeys for each rnd.

The above approach returns all subsets within each rnd, but simply adding a where clause on nelems prior to the join can limit to subsets of max length 4.

however takes quite a bit of time on the production tables with 11 pkeys for each rnd.

Yes, the number of subsets is exponential, you won't have to go too high in n before the problem becomes intractable regardless of approach.

Glad you found a solution anyway.