For a given SQL query, mechanisms for figuring out which physical columns would be returned?
Given any SQL query - such as:
select a.*, b.* from a join b on a.b_id = b.id
I would like to know the columns that would be returned by the query. Not just their names, but also which table those columns live in.
For the above example this might be:
a.id a.name a.b_id b.id b.name
But you could imagine this getting significantly more complex for joins across many different tables.
I want this so my software, Datasette, can display columns from specific tables in different ways, even when those columns are returned as part of an arbitrary query.
My biggest constraint is that I am using the sqlite3 module from the Python standard library. It looks like this family of C functions: https://www.sqlite.org/c3ref/column_database_name.html can solve my problem, but those are not exposed in the Python library I am using.
I wrote a bit more about my research into this here: https://github.com/simonw/datasette/issues/1293
So three questions:
Is there a mechanism for this that I might be missing?
Would it be feasible to achieve this using a custom SQLite extension? One that exposes the column_database_name() C API functions in a way that I can use them to answer my question? I was thinking something like this:
select database, table, column from analyze_query( 'select foo, bar from baz' )
Feature request: could SQLite itself add some kind of SQL-level mechanism for accessing this information, such that even if the Python sqlite3 module doesn't add support this information would become available via a PRAGMA or similar?
Here's a shorter, better way to describe the problem I want to solve.
Given an arbitrary SQL query, I would like to know:
- The name of the columns that will be returned if I execute the query
- For each of those returned columns, did it come from a specific column in a specific table? If so, which one?
- If a column was generated by an expression (a group_concat() for example) I'm OK not knowing which columns from which table were involved in that expression
Ideally I'd like to do this without executing the query (or while executing the query with a limit 0) but I'd be OK with a solution that did execute the query too.
In my projects, I write queries with an eye on the schema, so I always know where the results come from.
Also, I never use SELECT x.* , because any added column will ruin my day.
Anyway, you could try to interpret the output of
EXPLAIN SELECT, perhaps in combination with
PRAGMA table_info(), but it isn't easy, especially with aliased
CREATE TABLE a ( id INTEGER PRIMARY KEY NOT NULL , name TEXT , b_id INTEGER REFERENCES b(id) ); CREATE TABLE b ( id INTEGER PRIMARY KEY NOT NULL , name TEXT ); INSERT INTO b VALUES (2,'two'); INSERT INTO b VALUES (3,'three'); INSERT INTO a VALUES (1,'one',2); .mode column .headers on SELECT * FROM pragma_table_info('a') UNION ALL SELECT * FROM pragma_table_info('b'); EXPLAIN SELECT a.*, b.* FROM a JOIN b ON a.b_id = b.id; SELECT a.*, b.* FROM a JOIN b ON a.b_id = b.id;
cid name type notnull dflt_value pk --- ---- ------- ------- ---------- -- 0 id INTEGER 1 1 1 name TEXT 0 0 2 b_id INTEGER 0 0 0 id INTEGER 1 1 1 name TEXT 0 0 addr opcode p1 p2 p3 p4 p5 comment ---- ------------- ---- ---- ---- ------------- -- ------------- 0 Init 0 16 0 0 Start at 16 1 OpenRead 0 2 0 3 0 root=2 iDb=0; a 2 OpenRead 1 3 0 2 0 root=3 iDb=0; b 3 Explain 3 0 0 SCAN a 0 4 Rewind 0 15 0 0 5 Explain 5 0 0 SEARCH b USING INTEGER PRIMARY KEY (rowid=?) 0 6 Column 0 2 1 0 r=a.b_id 7 SeekRowid 1 14 1 0 intkey=r 8 Rowid 0 2 0 0 r=rowid 9 Column 0 1 3 0 r=a.name 10 Column 0 2 4 0 r=a.b_id 11 Rowid 1 5 0 0 r=rowid 12 Column 1 1 6 0 r=b.name 13 ResultRow 2 5 0 0 output=r[2..6] 14 Next 0 5 0 1 15 Halt 0 0 0 0 16 Transaction 0 0 2 0 1 usesStmtJournal=0 17 Goto 0 1 0 0 id name b_id id name -- ---- ---- -- ---- 1 one 2 2 two
This probably doesn't solve your problem, but it might give some inspiration.
Can't solve your problem, but how would you expect whatever solution results to respond to
SELECT age-1 FROM members
SELECT 365*20 FROM members
SELECT invoice.purch_amount * items.price
FROM invoice JOIN items ON item.id = invoice.item
That's really useful, thanks! It looks like it might be possible for me to reconstruct where each column came from using the
explain select output.
It looks like the opcodes I need to inspect are
OpenRead tells me which tables are being opened - the
p2 value (in this case 51) corresponds to the
rootpage column in
sqlite_master here: https://calands.datasettes.com/calands?sql=select+*+from+sqlite_master - it gets assigned to the register in
Column opcodes tell me which columns are being read -
p1 is that table reference, and
p2 is the
cid of the column within that table.
ResultRow opcode then tells me which columns are used in the results.
15 16 means start at the 15th and then read the next 16 columns.
I think this might work!
I think this might work!
Ok, great. Just one remark I should have made before: The output format of
EXPLAIN is not stable, it may change over versions.
Also, it lists the "byte code" for the virtual machine that executes the query, so if the optimizer get smarter, the code may change.
It might be stable enough for your purpose though.
For those cases I'm fine with it not being able to identify the column that is being used in the expression. I only want to be able to detect columns that are used as-is, without being part of a formula.
(8.1) By Simon Willison (simonw) on 2021-04-05 03:39:25 edited from 8.0 in reply to 6 [link] [source]
I have enough of a proof of concept implementation in https://github.com/simonw/datasette/issues/1293#issuecomment-813162622 that I'm pretty confident I can get this to work. I'm ready to take my chances with regards to stability - this feature is a nice-to-have, so if future changes break it I can work around that.
Thanks for the help!