SQLite Forum

Subqueries and "no such column" error on references to other tables
Login

Subqueries and "no such column" error on references to other tables

(1) By Gwendal Roué (groue) on 2020-04-19 12:41:17 [link] [source]

Hello,

I'm working on a SQL generator that wants to "flatten" a to-many relation backed by a foreign key into a to-one relation by focusing on a single row in the associated rows.

Sometimes examples make things more clear. Let's say we want to fetch "all chats along with their latest post", given the following schema:

CREATE TABLE chat (
  id INTEGER PRIMARY KEY
);
CREATE TABLE post (
  id INTEGER PRIMARY KEY,
   chatId INTEGER REFERENCES chat(id),
   date DATETIME
);

In my experiments looking for a SQL pattern that provides correct results, while remaining composable (I'm not looking for the most efficient SQL, and generally speaking I trust the SQLite query planner), I stumbled upon a surprising behavior with SQLite version 3.28.0.

This query works:

-- Success
SELECT chat.*, post.* 
FROM chat 
JOIN post ON post.id = (SELECT id FROM post WHERE post.id = chat.id ORDER BY date DESC LIMIT 1);

This one fails with a "no such column" error:

-- Error: no such column: chat.id
SELECT chat.*, post.* 
FROM chat 
JOIN (SELECT * FROM post WHERE post.id = chat.id ORDER BY date DESC LIMIT 1) post;

I have two questions, if someone can help. The first is the most important to me. The second is just a request for clarification:

  1. Can I rely on the fact that the first query will never give such an error in the future?
  2. Why can some subqueries refer to outer tables, and some others can not?

(2) By Neal (_Neal_) on 2020-04-19 14:25:17 in reply to 1 [link] [source]

Columns from both of the join operands can be referenced in ON clause - thus your first query works. But your second query is trying to refer to first join operand’s column in second join operand - that is not allowed.

(3) By Ryan Smith (cuz) on 2020-04-19 16:04:35 in reply to 1 [link] [source]

  1. Yes the first query will always work, but
  2. That is not the problem with query 2 - The SELECT is very happy, it's the JOIN that is the problem because you are joining to an as-yet non existing table/view (which the join creates) but then trying to refer to chat.id in that second join-created-view, however "chat" never appears in THAT join's SELECT .. FROM, and you cannot magically grab it from some other JOIN you did earlier. The JOINS are not the same and won't be in the same For-Loop at the same row/position at the same moment - that only works for correlated sub-queries, which your JOIN is not.

Perhaps read up on Correlated sub-queries if what I say did not make immediate sense, once you understand the limitations of those, it should become more clear why your JOIN did not behave as expected, and how you could make that sort of thing work.

Good luck!

(4) By Gwendal Roué (groue) on 2020-04-19 16:33:28 in reply to 3 [link] [source]

Perhaps read up on Correlated sub-queries if what I say did not make immediate sense

I'll surely do! Thank you Ryan for hinting me at a better understanding of what I'm trying to achieve. That's very well appreciated, and exactly what I was hoping for, even if I was too shy to ask for it :-)

(5.1) By Gwendal Roué (groue) on 2020-04-19 17:01:01 edited from 5.0 in reply to 3 [link] [source]

Perhaps read up on Correlated sub-queries if what I say did not make immediate sense

So I did. A correlated sub query executes as if it runs once for each row in the outer query.

But "executes as if" and "actually executes" are not always the same. The database is allowed to perform whatever optimization it wants, as long as the observed behavior matches the expected one.

So does the working query below exhibits a quadratic behavior?

CREATE TABLE chat (
  id INTEGER PRIMARY KEY
);
CREATE TABLE post (
  id INTEGER PRIMARY KEY,
  chatId INTEGER REFERENCES chat(id),
  date DATETIME
);
CREATE INDEX post_chatId ON post(chatId);

EXPLAIN QUERY PLAN
SELECT chat.*, post.* 
FROM chat 
JOIN post ON post.id = (SELECT id FROM post WHERE post.id = chat.id ORDER BY date DESC LIMIT 1);

I'm not quite used to query planning interpretation yet:

QUERY PLAN
|--SCAN TABLE chat
|--SEARCH TABLE post USING INTEGER PRIMARY KEY (rowid=?)
`--CORRELATED SCALAR SUBQUERY 1
   `--SEARCH TABLE post USING INTEGER PRIMARY KEY (rowid=?)

(6) By Ryan Smith (cuz) on 2020-04-20 01:03:49 in reply to 5.1 [link] [source]

Please forgive me, I may not be 100% following your question, and it's very late, but I'll give it a try.

Firstly, I don't know what "exhibits quadratic behavior" means in this case, but I will assume you are asking "will the QP do a Cartesian join where it steps ALL the "FROM rows" (chat) times ALL the JOINed rows (post). [aka O(n^2)]

That answer is no, even though it seems like it would have to.

The Query you propose (for which the Query plan is given) seemingly will start by walking (iterating through) the entire chat table (because you offer no filter in a WHERE clause or such).

Now note that the QP is free to choose to start with the joined table if it pleases, but won't in this case as reported by the query plan.

It then will simply look up the correlated value using the PK (rowid) of the "post" table, causing only O(log n) look-ups. Note that your ORDER BY and DESC LIMIT 1 statements has no effect and is completely ignored here because the link is to a rowid (or stated otherwise, the unique primary key) so row number N can only ever be row number N, no matter how it is "ordered" or limited, in any lookup.

i.e. - this should have the same query plan:

EXPLAIN QUERY PLAN
SELECT chat.*, post.* 
FROM chat 
JOIN post ON post.id = (SELECT id FROM post WHERE post.id = chat.id);

This may well NOT be the case for other types of correlated sub-query which perhaps do not use the primary key. In that case, the order-by would help to make the QP construct a temporary table with temporary index to do the look-ups. It is hard to guarantee though, and best practice when doing correlated sub-queries is to ensure you have Indexes on the looked-up fields.

I find it best to think of the correlated sub-query as a form of Table/View, and then asking myself if I can see a clear relationship between the tables and how I can phrase the query so the Query-planner cannot miss that same relationship. (And then test it with EXPLAIN of course).

Another way of doing the above would be with a Common Table Expression (CTE - more good reading material) which is like creating a view inside your query which may look something like this:

WITH PD AS (
   SELECT id FROM post
)
SELECT chat.*, PD.* 
FROM chat 
JOIN PD ON PD.id = chat.id;

which is roughly equivalent to:

SELECT chat.*, PD.* 
FROM chat 
JOIN (SELECT id FROM post) AS PD ON PD.id = chat.id;

but with the advantage that multiple CTEs can be added and that they can refer each other in their respective SELECT clauses.

This is all tricky, because the Query Planner can decide, based on many whims, how it wants to traverse query tables. It may be that your Query suddenly switches to another plan after adding some data and doing ANALYZE. The best is to test with some real-World data.

Hope that helps and best of luck!

(7) By Keith Medcalf (kmedcalf) on 2020-04-20 02:24:44 in reply to 5.1 [link] [source]

The performance will be, at best case, O(N*M^2) where N is the number of rows in chat and M is the number of rows in post. It will probably degrade a lot faster than that. We recently discussed that in another thread.

You appear to be trying to find the "earliest post" for each "thread". Note the following comments:

  • the chat table is useless when querying data as it does nothing except waste time and you can tell this because
    (a) there is nothing from that table in the select list; and,
    (b) removing all reference to it does not affect the query result in any way.

  • (chatid, date) must be a candidate key for the post table (that is, it must be unique). If it is not unique then the concept of top and bottom of a chatid cannot use that field and some other tiebreaker must be used. If some other tiebreaker must be used, just use that directly and forget about the added complication of adding something useless.

So, you want the following:

create table chat
(
    id      integer primary key
);
create table post
(
    id      integer primary key,
    chatid  integer not null references chat(id),
    date    text,
    unique (chatid, date)
);

with mins (chatid, mindate)
       as (
             select chatid,
                    min(date)
               from post
           group by chatid
          )
select post.*
  from mins, post
 where mins.chatid == post.chatid
   and mins.mindate == post.date;

(8.1) By Gwendal Roué (groue) on 2020-04-20 07:37:48 edited from 8.0 in reply to 7 [link] [source]

The performance will be, at best case, O(N*M^2) where N is the number of rows in chat and M is the number of rows in post. It will probably degrade a lot faster than that.

All right. Thank you very much for pointing this out! I can't use this technique, then, because I won't ship a library which generates SQL with such a bad complexity.

We recently discussed that in another thread.

I'd be happy to check this conversation.

You appear to be trying to find the "earliest post" for each "thread".

Actually, the example wants to load both thread ("chat") and latest post. Imagine you want to feed the main screen of a chat app, with all conversations and a preview of the last message.

the chat table is useless when querying data as it does nothing except waste time and you can tell this because (a) there is nothing from that table in the select list; and, (b) removing all reference to it does not affect the query result in any way.

So, the point is actually to get chat information along: its a SELECT chat.*, post.*

(chatid, date) must be a candidate key for the post table (that is, it must be unique).

I really appreciate your help. This is not an acceptable constraint, in the general case, unfortunately:

In the chat/post example, we can not require unique post dates in a thread. This would exclude a whole set of reasonable and real use cases. If the user wants to disambiguate the "last post", the user would would have to sort per date and another disambiguation column (such as the primary key). If the user does not perform such disambiguation, then any post can be chosen (meaning I can add the primary key as an extra disambiguation ordering clause in the SQL I generate).

The chat/post example is just one example, for illustration purpose. My general case is the following:

Input:

  • one "parent" table
  • one "child" table
  • a foreign key from child to parent
  • a set of ordering clauses on the child table, which identifies one child per parent. The ordering clauses may not identify a single child per parent. The set of ordering clauses may actually be empty.

Ouput:

  • An SQL query
    • which selects (parent.*, child.*) pairs,
    • where the child is the first child of each parent according by the set of ordering clauses (any child is ok if the ordering clauses are ambiguous)
    • where the child may actually be missing (if a parent has no child), in a similar fashion to LEFT JOIN
    • of acceptable complexity
  • Advice for indexes in order to get the desired complexity.

Of course, if this general case is too relaxed, and does not satisfy the requirements for such a desired output, then I'll have to adapt, and maybe distinguish more cases. For example, unique indexes, if acceptable by the user, can be used. But I have to deal with the fact that users may comes with a "weak" setup. Yet I do not want to punish users who have a "strong" setup with an SQL query which has a bad O(...) complexity.

It's a difficult exercice, but I hope I can find out a solution :-)

Edit: if there is no general solution to the general problem as I describe it, then I may have to provide an advice such as "in such situation, setup an extra foreign key to the desired child, and consider using triggers so that it get automatically updated whenever the child table is modified". That would be an acceptable solution also, even if much less easy for the users to setup. I consider the "all chats with their latest post" and all similar queries to be an important use case to address, and I want to help users who do not want to write their SQL themselves (because they don't want to, or because they can't).

(9) By Gwendal Roué (groue) on 2020-04-20 16:13:36 in reply to 7 [source]

Keith, you have provided excellent hints, which drove me to a solution that sounds quite general and satisfying.

It uses window functions, so it requires SQLite 3.28+.

It looks like it has a good complexity, as I see by running the request on datasets of various sizes. For a constant number of posts per chats, it is linear with the number of chats. For a constant number of chats, it is linear with the number of posts per chats. This sounds quite acceptable to me, as it fits well a kind of database that is quite common, in the wild.

It is general enough, because it accepts the input I have:

  • one parent table (here, "chat")
  • one child table (here, "post")
  • a foreign key from child to parent
  • a set of ordering clauses on the child table, which identifies one child per parent. The ordering clauses may not identify a single child per parent. The set of ordering clauses may actually be empty.

So, the SQL:

WITH latestPost AS (
    SELECT chatId, id AS postId FROM (
        SELECT chatId, id, RANK() OVER (
            PARTITION BY chatId
            ORDER BY
                -- Any number of user-provided ordering clauses
                date DESC,
                -- When user-provided ordering clauses are not unique,
                -- append a primary key ordering for disambiguation
                -- and guarantee that there is a single row with rank 1.
                id)
                AS _rank
        FROM post)
    WHERE _rank = 1
)
SELECT chat.*, post.*
FROM chat
LEFT JOIN latestPost ON latestPost.chatId = chat.id
LEFT JOIN post ON post.id = latestPost.postId;

The query plan:

QUERY PLAN
|--MATERIALIZE 1
|  |--CO-ROUTINE 4
|  |  |--SCAN TABLE post USING COVERING INDEX post_index
|  |  `--USE TEMP B-TREE FOR RIGHT PART OF ORDER BY
|  `--SCAN SUBQUERY 4
|--SCAN TABLE chat
|--SEARCH SUBQUERY 1 USING AUTOMATIC COVERING INDEX (_rank=? AND chatId=?)
`--SEARCH TABLE post USING INTEGER PRIMARY KEY (rowid=?)