Nearest-match join

(1) By Phil Endecott (endecotp) on 2021-05-15 16:51:08 [link] [source]

Dear Experts,

I have two tables, both of which have a timestamp column.

I'd like to join them on the timestamp, but where there is no exact match I'd like to match with the nearest timestamp from the other table.

Is there any clever way to achieve this efficiently?

Thanks, Phil.

(2) By anonymous on 2021-05-15 20:59:05 in reply to 1 [link] [source]

What if the 'nearest' is equidistant from two rows of the other table? Also, what is the worst possible difference that is still near enough?

(3) By Harald Hanche-Olsen (hanche) on 2021-05-16 09:48:52 in reply to 2 [link] [source]

If a solution can be found that includes one row for each “nearest” timestamp, then those questions can easily be resolved by applying more filtering to the output, I think. In other words, those are questions that must be asked for any practical application, but they don't really influence the search for a solution much.

(4) By Keith Medcalf (kmedcalf) on 2021-05-16 17:26:09 in reply to 1 [link] [source]

Next before or equal and next after or equal for a single value is an easy correlated subquery. Assuming that the tables have a single column primary key, simply generate a keyset using correlated subqueries, then use that keyset to retrieve your data. It is going to be slow.

(5.2) By Keith Medcalf (kmedcalf) on 2021-05-17 18:01:48 edited from 5.1 in reply to 4 [link] [source]

Example:

create table t1 (id integer primary key, ts real not null, a, b, c);
create table t2 (id integer primary key, ts real not null, d, e, f);

with nboe_id(id1, id2) -- find the Next Before or Equal keyset
  as (
      select t1.id as id1,
             (
                 select t2.id
                   from t2
                  where t2.ts <= t1.ts
               order by t2.ts desc
                  limit 1
             )
        from t1
     ),
     naoe_id(id1, id2) -- find the Next After or Equal keyset
  as (
      select t1.id as id1,
             (
                select t2.id
                  from t2
                 where t2.ts >= t1.ts
              order by t2.ts
                 limit 1
             )
        from t1
     ),
     keyset_ts(id1, id2) -- combine the next before and next after keysets
  as (
          select id1,
                 id2
            from nboe_id
           where id2 is not null
       union all
          select id1,
                 id2
            from naoe_id
           where id2 is not null
        order by id1
     ),
     keyset(id1, id2, mintsd) -- generate the keyset with the minimum distance
  as (
        select id1,
               id2,
               min(abs(t1.ts - t2.ts))
          from keyset_ts, t1, t2
         where id1 == t1.id
           and id2 == t2.id
      group by id1
     )
select t1.*,
       t2.*
  from keyset, t1, t2
 where id1 == t1.id
   and id2 == t2.id
;

NB: Not tested and I think I got the next before or equal and next after or equal order by's right.

NBB: Added slight optimizations

(6) By anonymous on 2021-05-17 01:55:31 in reply to 4 [link] [source]

That should be fast as long as the second table has an index on the timestamp column.

Personally, I think it's clearer without multiple CTEs:

    select id as id1,
        (select id2 from
            (select id2, delta from
                (select t2.id as id2, t2.ts-t1.ts as delta from t2
                 where t2.ts>=t1.ts
                 order by t2.ts
                 limit 1)
            union all select id2, delta from
                (select t2.id as id2, t1.ts-t2.ts as delta from t2
                 where t2.ts<=t1.ts
                 order by t2.ts desc
                 limit 1)
            order by delta
            limit 1)) as id2
    from t1;

(8.5) By Keith Medcalf (kmedcalf) on 2021-05-17 19:13:16 edited from 8.4 in reply to 6 [source]

It would require testing, but that may execute "faster" than the CTE because it only scans the table t1 once. Here it is "more pretty":

select id as id1,
       (
        select id2
          from (
                   select id2,
                          delta
                     from (
                             select t2.id as id2,
                                    t2.ts-t1.ts as delta
                               from t2
                              where t2.ts >= t1.ts
                           order by t2.ts
                              limit 1
                          )
                union all
                   select id2,
                          delta
                     from (
                             select t2.id as id2,
                                    t1.ts-t2.ts as delta
                               from t2
                              where t2.ts <= t1.ts
                           order by t2.ts desc
                              limit 1
                          )
                 order by delta
                    limit 1
               )
       ) as id2
  from t1
;

Note that it might be even more efficient to replace the union all with a regular union and remove the order by since "union" will use a b-tree for duplicate detection resulting in an ordered result (though this is perhaps implementation detail). Though I think then you have to make sure to exclude NULLs from the union.

select id as id1,
       (
        select id2
          from (
                   select delta,
                          id2
                     from (
                             select t2.id as id2,
                                    t2.ts-t1.ts as delta
                               from t2
                              where t2.ts >= t1.ts
                           order by t2.ts
                              limit 1
                          )
                union
                   select delta,
                          id2
                     from (
                             select t2.id as id2,
                                    t1.ts-t2.ts as delta
                               from t2
                              where t2.ts <= t1.ts
                           order by t2.ts desc
                              limit 1
                          )
               )
       ) as id2
  from t1
;

NB: Edited to change column order in union.

Also note that in the case of an equidistant before/after timestamps, the t2 row with the lower id will be chosen.

Removed the NULL check because no null rows will be generated.

(10) By Keith Medcalf (kmedcalf) on 2021-05-17 19:57:03 in reply to 8.5 [link] [source]

To select all the matching data:

create table t1 (id integer primary key, ts not null unique, a, b, c);
create table t2 (id integer primary key, ts not null unique, d, e, f);

select *
  from t1, t2
 where t2.id == (
                 select id2
                   from (
                            select delta,
                                   id2
                              from (
                                      select id as id2,
                                             ts - t1.ts as delta
                                        from t2
                                       where ts >= t1.ts
                                    order by ts
                                       limit 1
                                   )
                         union
                            select delta,
                                   id2
                              from (
                                      select id as id2,
                                             t1.ts - ts as delta
                                        from t2
                                       where ts <= t1.ts
                                    order by ts desc
                                       limit 1
                                   )
                        )
                )
;

(7) By MBL (RobMan) on 2021-05-17 09:53:02 in reply to 1 [link] [source]

I would use the WINDOW functionality to get the previous timestamp to the same row at which I am looking for the nearest match and then use both times (from and to) to make an left outer join to the 2nd table and get all the timestamp which fell into the time frame.

My approach does not only look for one record, which is "the" closest but tries to compare all records in a "diff" way. It can be that you get none, one or several occurancies for each frame on the left - no record will get lost but shown - however, you have to expect NULL columns with my method.

(11) By MBL (RobMan) on 2021-05-18 16:11:48 in reply to 7 [link] [source]

create table T1 (
  id integer primary key,
  ts real    not null,
   a char    not null
);
insert into T1(id, ts, a)
values (1, 1.0, 'A'), (2, 1.2, 'B'), (3, 1.8, 'C'), (4, 2.5, 'D')
     , (5, 3.6, 'E'), (6, 4.5, 'F'), (7, 4.6, 'G'), (8, 9.8, 'H');

create table T2 (
  id integer primary key,
  ts real    not null,
   b integer not null
);
insert into T2(id, ts, b)
values (10, 2.0, 50), (20, 2.5, 100), (30, 9.5, 200), (40, 4.5, 300);

drop view if exists NearestMatch;
create view if not exists NearestMatch as
with timeFrames as (
select *, lag(T1.ts) over win as tsLag
 from T1
WINDOW win AS (order by T1.ts)
)
select TF.id,TF.a,TF.tsLag,TF.ts, T2.*
  from timeFrames TF
  left outer join T2 on T2.ts > TF.tsLag and T2.ts <= TF.ts;

select * from NearestMatch;

sqlite> .mode box
sqlite> select * from NearestMatch;
┌────┬───┬───────┬─────┬──────┬──────┬─────┐
│ id │ a │ tsLag │ ts  │ id:1 │ ts:1 │  b  │
├────┼───┼───────┼─────┼──────┼──────┼─────┤
│ 1  │ A │       │ 1.0 │      │      │     │
│ 2  │ B │ 1.0   │ 1.2 │      │      │     │
│ 3  │ C │ 1.2   │ 1.8 │      │      │     │
│ 4  │ D │ 1.8   │ 2.5 │ 10   │ 2.0  │ 50  │
│ 4  │ D │ 1.8   │ 2.5 │ 20   │ 2.5  │ 100 │
│ 5  │ E │ 2.5   │ 3.6 │      │      │     │
│ 6  │ F │ 3.6   │ 4.5 │ 40   │ 4.5  │ 300 │
│ 7  │ G │ 4.5   │ 4.6 │      │      │     │
│ 8  │ H │ 4.6   │ 9.8 │ 30   │ 9.5  │ 200 │
└────┴───┴───────┴─────┴──────┴──────┴─────┘
sqlite>

As you can see the left table may show repetitions but has no "holes" while the right side may show "holes" but matches every time frame from the left table.

I did not try with huge data size but I guess this view is not the slowest.

(9) By Lifepillar (lifepillar) on 2021-05-17 19:30:07 in reply to 1 [link] [source]

I assume that you want to find, for each record in a table T1, the nearest record(s) in table T2 (but not vice versa). As already noted by others, there may be more than one record in T2 matching a record in T1.

Sample data:

create table T1 (
  id integer primary key,
  ts real    not null,
   a char    not null
);

create table T2 (
  id integer primary key,
  ts real    not null,
   b integer not null
);

insert into T1(id, ts, a)
values (1, 1.0, 'A'), (2, 2.0, 'B'), (3, 3.5, 'C');

insert into T2(id, ts, b)
values (10, 2.0, 50), (20, 2.5, 100), (30, 9.5, 200), (40, 4.5, 300);

Solution 1:

select X.id, X.ts, X.a, Y.id, Y.ts, Y.b, abs(Y.ts - X.ts) as diff
  from T1 X, T2 Y
 where not exists (select 1
                     from T2 Z
                    where abs(Z.ts - X.ts) < abs(Y.ts - X.ts)
                  );

Solution 2:

with ranking_table as (
  select rank() over (partition by T1.id order by abs(T2.ts - T1.ts)) as ranking,
         T1.id as t1id, 
         T1.ts as t1ts,
         T1.a as t1a,
         T2.id as t2id,
         T2.ts as t2ts,
         T2.b as t2b,
         abs(T2.ts - T1.ts) as diff
    from T1, T2
)
select t1id, t1ts, t1a, t2id, t2ts, t2b, diff
  from ranking_table
 where ranking = 1;

The latter should perform better, but you have to test it. Solution 2 has also the advantage that you may easily find the N timestamps in T2 nearest to each timestamp in T1, for N>1, simply by using where ranking <= N.

In both cases, the queries can be made to perform much better if you can set an upper bound to the difference between the timestamps.

This is an instance of the k-nearest neighbour problem, if you want to explore it further.

(12) By Phil Endecott (endecotp) on 2021-05-20 16:55:20 in reply to 1 [link] [source]

Thanks everyone for your solutions and sorry for not following-up sooner.

I spend more time writing C++ than SQL and in my mind I have an "obvious" algorithm for this problem - I would concurrently scan both tables sequentially using by-timestamp indexes. If the first table has N rows and the second table has M rows, then that is O(N+M).

If I do an exact join in SQL, it seems to do:

QUERY PLAN
|--SCAN TABLE table1
`--SEARCH TABLE table2 USING INDEX table2_by_datetime (<expr>=?)

That is O(N log M), which is worse than O(N+M) unless M is much larger than N. In this test, table1 has 8000 rows and table 2 has 62000 rows.

There is also an index table1_by_datetime, but it's not using that. Perhaps the query planner has looked at the sizes of the tables and decided that this is better. Does sqlite have a join implementation that scans two indexes?

It may also be significant that my indexes and join use the datetime() function:

create index table1_by_datetime on table1( datetime(timestamp) );
create index table2_by_datetime on table2( datetime(timestamp) );
select * from table1 join table2 on( datetime(table1.timestamp) == datetime(table2.timestamp) );

Anyway... if that query plan for the exact join is the best, then I can believe that some of the solutions you've posted above may be have the same cost. I will investigate.

I envisage eventually having hundreds of thousands of rows in the first table and tens of millions of rows in the second table.

(13) By MBL (RobMan) on 2021-05-20 17:41:33 in reply to 12 [link] [source]

Try to use my solution but use T1 for the table with more rows and T2 with the table which has less rows - and index on the timestamps (or use them as primary keys) so that the > and <= comparisons can go straight to the records instead of having to scan.

Also the Window function may be able to optimize a bit more as you always know that you reuse the timestamp rom just one record before.

SQLite\Examples>sqlite3 search.sqlite
SQLite version 3.35.5 2021-04-19 18:32:05
Enter ".help" for usage hints.
sqlite> .mode box
sqlite> select * from NearestMatch where tsPrev notNull;
┌───┬────────┬─────┬──────┬─────┐
│ a │ tsPrev │ ts  │ ts:1 │  b  │
├───┼────────┼─────┼──────┼─────┤
│ B │ 1.0    │ 1.2 │      │     │
│ C │ 1.2    │ 1.8 │      │     │
│ D │ 1.8    │ 2.5 │ 2.0  │ 50  │
│ D │ 1.8    │ 2.5 │ 2.5  │ 100 │
│ E │ 2.5    │ 3.6 │      │     │
│ F │ 3.6    │ 4.5 │ 4.5  │ 200 │
│ G │ 4.5    │ 4.6 │      │     │
│ H │ 4.6    │ 9.8 │ 8.5  │ 300 │
└───┴────────┴─────┴──────┴─────┘

created by following example SQL statements:

drop table if exists T1;
create table if not exists T1 (
  ts real    not null  primary key,
   a char    not null
);
drop table if exists T2;
create table if not exists T2 (
  ts real    not null  primary key,
   b integer not null
);

replace into T1(ts, a)
values (1.0, 'A'), (1.2, 'B'), (1.8, 'C'), (2.5, 'D')
     , (3.6, 'E'), (4.5, 'F'), (4.6, 'G'), (9.8, 'H');
replace into T2(ts, b)
values (2.0, 50), (2.5, 100), (4.5, 200), (8.5, 300);

drop view if exists NearestMatch;
create view NearastMatch as
with timeFrames as (
select first_value(T1.ts) over win as tsPrev, *
 from T1
WINDOW win AS (ORDER BY T1.ts ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)
)
select TF.a,TF.tsPrev,TF.ts, T2.*
  from timeFrames TF
  left outer join T2 on T2.ts > TF.tsPrev and T2.ts <= TF.ts;

select * from NearestMatch;

Here the windowing function last_value() or first_value() may be faster than lag() or lead() and the range is limited just to the one and only previous record.

(15) By Phil Endecott (endecotp) on 2021-05-21 12:37:12 in reply to 13 [link] [source]

Using window functions, it seems to materialize the view. Apart from that it's efficient.

(14) By Phil Endecott (endecotp) on 2021-05-21 11:29:28 in reply to 12 [link] [source]

As an aside, it's a bit unfortunate that I can't write

select a, b, (select c, d from y where ......... limit 1) from t;
Error: sub-select returns 2 columns - expected 1

Instead, I seem to have to get just one (unique) column in the sub-select and then join to get the others:

select a, b, (select c from y where ...... limit 1), d from t join y on (c == y.c);

That's more verbose to write and also seems to have a more complex query plan. Am I missing any tricks?