For a given SQL query, mechanisms for figuring out which physical columns would be returned?

(1.1) By Simon Willison (simonw) on 2021-04-04 23:40:35 edited from 1.0 [link] [source]

Given any SQL query - such as:

select a.*, b.* from a join b on a.b_id = b.id

I would like to know the columns that would be returned by the query. Not just their names, but also which table those columns live in.

For the above example this might be:

a.id
a.name
a.b_id
b.id
b.name

But you could imagine this getting significantly more complex for joins across many different tables.

I want this so my software, Datasette, can display columns from specific tables in different ways, even when those columns are returned as part of an arbitrary query.

My biggest constraint is that I am using the sqlite3 module from the Python standard library. It looks like this family of C functions: https://www.sqlite.org/c3ref/column_database_name.html can solve my problem, but those are not exposed in the Python library I am using.

I wrote a bit more about my research into this here: https://github.com/simonw/datasette/issues/1293

So three questions:

Is there a mechanism for this that I might be missing?
Would it be feasible to achieve this using a custom SQLite extension? One that exposes the column_database_name() C API functions in a way that I can use them to answer my question? I was thinking something like this:
```
select database, table, column from analyze_query(
    'select foo, bar from baz'
)
```
Feature request: could SQLite itself add some kind of SQL-level mechanism for accessing this information, such that even if the Python sqlite3 module doesn't add support this information would become available via a PRAGMA or similar?

(2) By Simon Willison (simonw) on 2021-04-04 23:37:29 in reply to 1.0 [link] [source]

Here's a shorter, better way to describe the problem I want to solve.

Given an arbitrary SQL query, I would like to know:

The name of the columns that will be returned if I execute the query
For each of those returned columns, did it come from a specific column in a specific table? If so, which one?
If a column was generated by an expression (a group_concat() for example) I'm OK not knowing which columns from which table were involved in that expression

Ideally I'd like to do this without executing the query (or while executing the query with a limit 0) but I'd be OK with a solution that did execute the query too.

(3) By Kees Nuyt (knu) on 2021-04-05 00:09:02 in reply to 2 [link] [source]

In my projects, I write queries with an eye on the schema, so I always know where the results come from.

Also, I never use SELECT x.* , because any added column will ruin my day.

Anyway, you could try to interpret the output of EXPLAIN SELECT, perhaps in combination with PRAGMA table_info(), but it isn't easy, especially with aliased rowid.

Example:

CREATE TABLE a (
	id    INTEGER PRIMARY KEY NOT NULL
,	name  TEXT
,	b_id  INTEGER REFERENCES b(id)
);
CREATE TABLE b (
	id    INTEGER PRIMARY KEY NOT NULL
,	name  TEXT
);
INSERT INTO b VALUES (2,'two');
INSERT INTO b VALUES (3,'three');
INSERT INTO a VALUES (1,'one',2);
.mode column
.headers on
SELECT * FROM pragma_table_info('a')
UNION ALL
SELECT * FROM pragma_table_info('b');
EXPLAIN SELECT a.*, b.* FROM a JOIN b ON a.b_id = b.id;
SELECT a.*, b.* FROM a JOIN b ON a.b_id = b.id;

Result:

cid  name  type     notnull  dflt_value  pk
---  ----  -------  -------  ----------  --
0    id    INTEGER  1                    1
1    name  TEXT     0                    0
2    b_id  INTEGER  0                    0
0    id    INTEGER  1                    1
1    name  TEXT     0                    0
addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     16    0                    0   Start at 16
1     OpenRead       0     2     0     3              0   root=2 iDb=0; a
2     OpenRead       1     3     0     2              0   root=3 iDb=0; b
3     Explain        3     0     0     SCAN a         0
4     Rewind         0     15    0                    0
5       Explain        5     0     0     SEARCH b USING INTEGER PRIMARY KEY (rowid=?)  0
6       Column         0     2     1                    0   r[1]=a.b_id
7       SeekRowid      1     14    1                    0   intkey=r[1]
8       Rowid          0     2     0                    0   r[2]=rowid
9       Column         0     1     3                    0   r[3]=a.name
10      Column         0     2     4                    0   r[4]=a.b_id
11      Rowid          1     5     0                    0   r[5]=rowid
12      Column         1     1     6                    0   r[6]=b.name
13      ResultRow      2     5     0                    0   output=r[2..6]
14    Next           0     5     0                    1
15    Halt           0     0     0                    0
16    Transaction    0     0     2     0              1   usesStmtJournal=0
17    Goto           0     1     0                    0

id  name  b_id  id  name
--  ----  ----  --  ----
1   one   2     2   two

This probably doesn't solve your problem, but it might give some inspiration.

(4) By Simon Slavin (slavin) on 2021-04-05 01:16:49 in reply to 1.1 [source]

Can't solve your problem, but how would you expect whatever solution results to respond to

SELECT age-1 FROM members

SELECT 365*20 FROM members

SELECT invoice.purch_amount * items.price FROM invoice JOIN items ON item.id = invoice.item

(5) By Simon Willison (simonw) on 2021-04-05 01:17:47 in reply to 3 [link] [source]

That's really useful, thanks! It looks like it might be possible for me to reconstruct where each column came from using the explain select output.

Here's a complex example: https://calands.datasettes.com/calands?sql=explain+select%0D%0A++AsGeoJSON%28geometry%29%2C+*%0D%0Afrom%0D%0A++CPAD_2020a_SuperUnits%0D%0Awhere%0D%0A++PARK_NAME+like+%27%25mini%25%27+and%0D%0A++Intersects%28GeomFromGeoJSON%28%3Afreedraw%29%2C+geometry%29+%3D+1%0D%0A++and+CPAD_2020a_SuperUnits.rowid+in+%28%0D%0A++++select%0D%0A++++++rowid%0D%0A++++from%0D%0A++++++SpatialIndex%0D%0A++++where%0D%0A++++++f_table_name+%3D+%27CPAD_2020a_SuperUnits%27%0D%0A++++++and+search_frame+%3D+GeomFromGeoJSON%28%3Afreedraw%29%0D%0A++%29&freedraw=%7B%22type%22%3A%22MultiPolygon%22%2C%22coordinates%22%3A%5B%5B%5B%5B-122.42202758789064%2C37.82280243352759%5D%2C%5B-122.39868164062501%2C37.823887203271454%5D%2C%5B-122.38220214843751%2C37.81846319511331%5D%2C%5B-122.35061645507814%2C37.77071473849611%5D%2C%5B-122.34924316406251%2C37.74465712069939%5D%2C%5B-122.37258911132814%2C37.703380457832374%5D%2C%5B-122.39044189453125%2C37.690340943717715%5D%2C%5B-122.41241455078126%2C37.680559803205135%5D%2C%5B-122.44262695312501%2C37.67295135774715%5D%2C%5B-122.47283935546876%2C37.67295135774715%5D%2C%5B-122.52502441406251%2C37.68382032669382%5D%2C%5B-122.53463745117189%2C37.6892542140253%5D%2C%5B-122.54699707031251%2C37.690340943717715%5D%2C%5B-122.55798339843751%2C37.72945260537781%5D%2C%5B-122.54287719726564%2C37.77831314799672%5D%2C%5B-122.49893188476564%2C37.81303878836991%5D%2C%5B-122.46185302734376%2C37.82822612280363%5D%2C%5B-122.42889404296876%2C37.82822612280363%5D%2C%5B-122.42202758789064%2C37.82280243352759%5D%5D%5D%5D%7D

It looks like the opcodes I need to inspect are OpenRead, Column and ResultRow.

OpenRead tells me which tables are being opened - the p2 value (in this case 51) corresponds to the rootpage column in sqlite_master here: https://calands.datasettes.com/calands?sql=select+*+from+sqlite_master - it gets assigned to the register in p1.

The Column opcodes tell me which columns are being read - p1 is that table reference, and p2 is the cid of the column within that table.

The ResultRow opcode then tells me which columns are used in the results. 15 16 means start at the 15th and then read the next 16 columns.

I think this might work!

(6) By Kees Nuyt (knu) on 2021-04-05 02:34:36 in reply to 5 [link] [source]

I think this might work!

Ok, great. Just one remark I should have made before: The output format of EXPLAIN is not stable, it may change over versions.

Also, it lists the "byte code" for the virtual machine that executes the query, so if the optimizer get smarter, the code may change.

It might be stable enough for your purpose though.

(7) By Simon Willison (simonw) on 2021-04-05 03:38:07 in reply to 4 [link] [source]

For those cases I'm fine with it not being able to identify the column that is being used in the expression. I only want to be able to detect columns that are used as-is, without being part of a formula.

(8.1) By Simon Willison (simonw) on 2021-04-05 03:39:25 edited from 8.0 in reply to 6 [link] [source]

I have enough of a proof of concept implementation in https://github.com/simonw/datasette/issues/1293#issuecomment-813162622 that I'm pretty confident I can get this to work. I'm ready to take my chances with regards to stability - this feature is a nice-to-have, so if future changes break it I can work around that.

Thanks for the help!