Join tables by nearest older date
(1) By TripeHound on 2020-06-18 19:53:38 [link] [source]
Given (simplified) tables:
create table WinEvents ( WinID integer primary key not null, WinDate text -- format "yyyy-mm-dd" ) create table GameEvents ( GameID integer primary key not null, GameDate text -- format "yyyy-mm-dd" )
I'd like to select all records in
WinEvents joined with the most recent record from
GameDate <= WinDate.
If I knew that the dates always matched, it would be a simple join:
select * from WinEvents join GameEvents on WinEvents.WinDate = GameEvents.GameDate
but unfortunately many of the records in
WinEvents hold the date that win-events were recorded, which can be a few days after the date of the game-event to which they (should) refer. (I didn't choose this structure: I want the above query to help migrate existing data to a better schema).
I've tried a few things of my own devising without luck, and the closest examples I've found on the web all seem to be about finding the most recent record from the several that match through a more typical JOIN based on ID fields.
Thanks in advance for any assistance.
One would presume that you have "other fields" in WinEvents and GameEvents that you want to be able to access. So here is (one possible) solution.
Firstly, you need to generate the list of matches, and then retrieve the data. So you can find the matching keys as follows:
select WinID, ( select GameID from GameEvents where GameDate <= WinDate order by GameDate desc limit 1 ) as GameID from WinEvents ;
which will give you all the WinID's and their corresponding GameID. Now you want to be able to access all "joined" rows. There are multiple ways to do that:
with jt(WinID, GameID) as ( select WinID, ( select GameID from GameEvents where GameDate <= WinDate order by GameDate desc limit 1 ) as GameID from WinEvents ) select WinEvents.*, GameEvents.* from jt join WinEvents on jt.WinID == WinEvents.WinID join GameEvents on jt.GameID == GameEvents.GameID ;
or as same thing but without using the CTE syntactic sugar,
select WinEvents.*, GameEvents.* from ( select WinID, ( select GameID from GameEvents where GameDate <= WinDate order by GameDate desc limit 1 ) as GameID from WinEvents ) as jt join WinEvents on jt.WinID == WinEvents.WinID join GameEvents on jt.GameID == GameEvents.GameID ;
Note that you will likely want an index on GameDate ...
(3) By TripeHound on 2020-06-18 21:03:51 in reply to 2.2 [link] [source]
Thanks, Keith. You're correct that there are other fields in both tables. I'll give it a try when I'm back at my keyboard.
You could also do something like this:
select * from WinEvents join GameEvents on GameID = ( select GameID from GameEvents where GameDate <= WinDate order by GameDate desc limit 1 ) ;
Note that they are all really pretty much the same thing. They all use a correlated subquery to find the appropriate GameID which basically means that subquery is run for each row of the result.
(5) By TripeHound on 2020-06-18 22:30:23 in reply to 4 [link] [source]
Thanks again. What's really annoying is that I originally had something close to your last alternative, but couldn't quite get it to work... mostly it kept complaining about "no such column" inside the correlated query (even though I'm certain the column name was correct).
At the time, I put it down to some restriction on what could be used from the outer query within the correlated query, and that's when I asked here.
Having tried your version, I now suspect my original problem was something – that to me appears odd – to do with table aliases. I tend to use these a lot, so I would have had something like
select * from WinEvents as WE join GameEvents as GE on GE.GameID ...
I can't remember exactly the version I tried, but the "seems odd to me" behaviour can be shown by the following much simpler queries:
sqlite> select * from Members as M order by M.FirstName ; <...produces results as expected...> sqlite> select * from Members as M order by Members.FirstName ; Error: no such column: Members.FirstName
Am I right to feel surprised by this? I thought "
...as M" allowed you to use the (usually) shorter alias, but I didn't think it insisted on using the alias... I feel (fairly) sure I've seen queries that mix-and-match the real and aliased name. (I get the same result using the perhaps more canonical version that omits the "
I have no idea if this should be surprising or not. To tell you the truth, I have never assigned a table-alias to something and then tried to use the unaliased name (which would make one wonder why bother going to all the trouble of assigning an alias then -- it is just extra typing for naught).
Then again, you can use the same table multiple times in a single query if both are in scope then you must use an alias to refer to one of them.
This is a tricky aspect of SQL and I know that different SQL engines vary on this point, but in general it is better - once you have defined a table alias to use the alias and not the table name. There are cases (not this simple one) where ambiguities result if you don’t follow that advice.
Imagine, for example, if you were joining a table to itself. In that case, you would need to be consistent about your references and any reference to the table name would be potentially ambiguous.
Mr. Medcalf’s solution is very transparent and straight-forward and it works. However, I learned to do that sort of thing more like this:
select * from WinEvents join GameEvents as Game1 on GameDate <= WinDate and not exists( select null from GameEvents Game2 where Game2.GameDate <= WinDate and Game2.GameDate > Game1.GameDate ) ;
Among other reasons, the use of ORDER BY and LIMIT in a subquery is not supported by all SQL engines. Otoh, I admit that my method is harder to understand.
I think you will have a problem if two games occur on the same date, but maybe you know that won’t ever happen.