Join tables by nearest older date

(1) By TripeHound on 2020-06-18 19:53:38 [link] [source]

Given (simplified) tables:

create table WinEvents (
    WinID      integer primary key not null,
    WinDate    text   -- format "yyyy-mm-dd"
)

create table GameEvents (
    GameID     integer primary key not null,
    GameDate   text   -- format "yyyy-mm-dd"
)

I'd like to select all records in WinEvents joined with the most recent record from GameEvents where GameDate <= WinDate.

If I knew that the dates always matched, it would be a simple join:

select *
    from WinEvents
    join GameEvents
    on WinEvents.WinDate = GameEvents.GameDate

but unfortunately many of the records in WinEvents hold the date that win-events were recorded, which can be a few days after the date of the game-event to which they (should) refer. (I didn't choose this structure: I want the above query to help migrate existing data to a better schema).

I've tried a few things of my own devising without luck, and the closest examples I've found on the web all seem to be about finding the most recent record from the several that match through a more typical JOIN based on ID fields.

Thanks in advance for any assistance.

(2.2) By Keith Medcalf (kmedcalf) on 2020-06-18 20:25:19 edited from 2.1 in reply to 1 [source]

One would presume that you have "other fields" in WinEvents and GameEvents that you want to be able to access. So here is (one possible) solution.

Firstly, you need to generate the list of matches, and then retrieve the data. So you can find the matching keys as follows:

select WinID,
       (
          select GameID
            from GameEvents
           where GameDate <= WinDate
        order by GameDate desc
           limit 1
       ) as GameID
  from WinEvents
;

which will give you all the WinID's and their corresponding GameID. Now you want to be able to access all "joined" rows. There are multiple ways to do that:

with jt(WinID, GameID)
  as (
      select WinID,
             (
                select GameID
                  from GameEvents
                 where GameDate <= WinDate
              order by GameDate desc
                 limit 1
             ) as GameID
        from WinEvents
     )
select WinEvents.*, GameEvents.*
  from jt
  join WinEvents
    on jt.WinID == WinEvents.WinID
  join GameEvents
    on jt.GameID == GameEvents.GameID
;

or as same thing but without using the CTE syntactic sugar,

select WinEvents.*, GameEvents.*
  from (
        select WinID,
               (
                  select GameID
                    from GameEvents
                   where GameDate <= WinDate
                order by GameDate desc
                   limit 1
               ) as GameID
          from WinEvents
         ) as jt
  join WinEvents
    on jt.WinID == WinEvents.WinID
  join GameEvents
    on jt.GameID == GameEvents.GameID
;

Note that you will likely want an index on GameDate ...

(3) By TripeHound on 2020-06-18 21:03:51 in reply to 2.2 [link] [source]

Thanks, Keith. You're correct that there are other fields in both tables. I'll give it a try when I'm back at my keyboard.

(4) By Keith Medcalf (kmedcalf) on 2020-06-18 21:33:33 in reply to 2.2 [link] [source]

You could also do something like this:

select *
  from WinEvents
  join GameEvents
    on GameID = (
                   select GameID
                     from GameEvents
                    where GameDate <= WinDate
                 order by GameDate desc
                    limit 1
                )
;

Note that they are all really pretty much the same thing. They all use a correlated subquery to find the appropriate GameID which basically means that subquery is run for each row of the result.

(5) By TripeHound on 2020-06-18 22:30:23 in reply to 4 [link] [source]

Thanks again. What's really annoying is that I originally had something close to your last alternative, but couldn't quite get it to work... mostly it kept complaining about "no such column" inside the correlated query (even though I'm certain the column name was correct).

At the time, I put it down to some restriction on what could be used from the outer query within the correlated query, and that's when I asked here.

Having tried your version, I now suspect my original problem was something – that to me appears odd – to do with table aliases. I tend to use these a lot, so I would have had something like

select *
  from WinEvents as WE
  join GameEvents as GE
    on GE.GameID ...

I can't remember exactly the version I tried, but the "seems odd to me" behaviour can be shown by the following much simpler queries:

sqlite> select * from Members as M order by M.FirstName ;
<...produces results as expected...>
sqlite> select * from Members as M order by Members.FirstName ;
Error: no such column: Members.FirstName

Am I right to feel surprised by this? I thought "...as M" allowed you to use the (usually) shorter alias, but I didn't think it insisted on using the alias... I feel (fairly) sure I've seen queries that mix-and-match the real and aliased name. (I get the same result using the perhaps more canonical version that omits the "AS" keyword).

(6) By Keith Medcalf (kmedcalf) on 2020-06-19 02:16:58 in reply to 5 [link] [source]

I have no idea if this should be surprising or not. To tell you the truth, I have never assigned a table-alias to something and then tried to use the unaliased name (which would make one wonder why bother going to all the trouble of assigning an alias then -- it is just extra typing for naught).

Then again, you can use the same table multiple times in a single query if both are in scope then you must use an alias to refer to one of them.

(7) By L Carl (lcarlp) on 2020-06-19 14:57:07 in reply to 5 [link] [source]

This is a tricky aspect of SQL and I know that different SQL engines vary on this point, but in general it is better - once you have defined a table alias to use the alias and not the table name. There are cases (not this simple one) where ambiguities result if you don’t follow that advice.

Imagine, for example, if you were joining a table to itself. In that case, you would need to be consistent about your references and any reference to the table name would be potentially ambiguous.

(8) By L Carl (lcarlp) on 2020-06-23 20:47:49 in reply to 4 [link] [source]

Mr. Medcalf’s solution is very transparent and straight-forward and it works. However, I learned to do that sort of thing more like this:

select *
  from WinEvents
  join GameEvents as Game1
    on GameDate <= WinDate
   and not exists(
        select null 
          from GameEvents Game2
         where Game2.GameDate <= WinDate
           and Game2.GameDate > Game1.GameDate )
;

Among other reasons, the use of ORDER BY and LIMIT in a subquery is not supported by all SQL engines. Otoh, I admit that my method is harder to understand.

I think you will have a problem if two games occur on the same date, but maybe you know that won’t ever happen.