using selects ambiguous column in successive joins

(1) By anacrolix on 2022-11-10 00:59:57 [link] [source]

When joining more than 2 tables, with a shared column name, the "using" join constraint does not apply to the tables strictly to the left and right of the join operator. The documentation is in https://www.sqlite.org/lang_select.html:

When more than two tables are joined together as part of a FROM clause, the join operations are processed in order from left to right. In other words, the FROM clause (A join-op-1 B join-op-2 C) is computed as ((A join-op-1 B) join-op-2 C).

If there is a USING clause then each of the column names specified must exist in the datasets to both the left and right of the join-operator. For each pair of named columns, the expression "lhs.X = rhs.X" is evaluated for each row of the cartesian product as a boolean expression.

If there is a USING clause then each of the column names specified must exist in the datasets to both the left and right of the join-operator.

In the best case, a join b join c using (a) should use a from b if it exists there. In the worst case, if tables a and b both have an 'a', then the using (a) should find a to be ambiguous (after a join b, the data set contains a.a and b.a).

Here is a reproduction (sqlite3 < repro.sql):

create table a(a);
create table b(a);
create table c(a);
insert into a values (1), (2);
insert into b values (2), (3);
insert into c values (3), (4);
.mode table
.echo on
-- this should join b.a and c.a giving us
-- 1 3
-- 2 3
select * from a join b join c using (a);
-- if we do it manually
select * from a join b join c on b.a=c.a;
-- (without b.a and c.a being combined into one column, which is expected).

The output from the last statement:

+---+---+---+
| a | a | a |
+---+---+---+
| 1 | 3 | 3 |
| 2 | 3 | 3 |
+---+---+---+

Please also see my previous issues:

(2) By ddevienne on 2022-11-10 09:33:57 in reply to 1 [link] [source]

a join b join c using (a) should use a from b if it exists there.
In the worst case, if tables a and b both have an a, then the using (a) should find a to be ambiguous

But if the a and b tables were joined using the a column too, as in your example, then "a"."a" and "b"."a" are by definition equivalent, so the ambiguity could be lifted.

I'm not saying that's what's happening, just thinking logically through your example (if I understood it right).

(3) By Gunter Hick (gunter_hick) on 2022-11-10 10:57:25 in reply to 1 [link] [source]

If you were to materialize the first join by

create table x as select * from a join b;

then SQLite will have generate column names to resolve the issue of both columns being named a. In SQLite 3.24 this yields columns named "a" and "a:1". What would you expect

select * from x join c using(a);

to do?

alternatively you can

WITH x(a,b) AS SELECT * from a join b
SELECT x join c using (a);

which does the same as

WITH x(a,a) AS SELECT * from a join b
SELECT x join c using (a);

while you seem to expect

WITH x(b,a) AS SELECT * from a join b
SELECT x join c using (a);

What is the reason you are expecting the last interpretation?

(4) By anacrolix on 2022-11-14 02:55:38 in reply to 2 [link] [source]

Yes I think you're correct here. But I'm not joining table a and b on a column.

(5) By anacrolix on 2022-11-14 02:58:06 in reply to 3 [link] [source]

I suspect this is what's happening under the hood, I guess it's not clear that b's column a is being renamed. This does explain the behaviour, thanks.

Per the quoted sections in my OP, it's not clear that b's column a would not be favored over a's. It's quite unexpected. I'm not sure that much can come of this, your explanation above makes a lot of sense, perhaps this can be highlighted in the documentation though.

(6) By Simon Slavin (slavin) on 2022-11-14 04:23:12 in reply to 5 [source]

Anything documented is fixed, and even if more logical behaviour is identified (as you have done in this thread), can't be changed because someone may have read the documentation and programmed accordingly.

I think this will remain undocumented because SQLite programmers are better-off not depending on it. It may change in a future version of SQLite, either because the developers intentionally change to a more logical behaviour, or as an unintended consequence of some other change to how SQLite works.

You've recognised an ambiguity by your thread title. A production program (one you're going to write and ship) shouldn't depend on on anything ambiguous. A temporary hack developed for your own use, on the other hand, just has to work long enough to do whatever you need to do.

(7) By Gunter Hick (gunter_hick) on 2022-11-14 06:56:01 in reply to 5 [link] [source]

The real issue is that you are expecting specific column names without setting them, which is an issue that regularly comes up on the forum. In your variant, it is column names of the intermediate result set from the first join that are not explicitly set.