For a given SQL query, mechanisms for figuring out which physical columns would be returned?
(1) By Simon Willison (simonw) on 2021-04-04 23:19:13 updated by 1.1 [link]
Given any SQL query - such as: select a.*, b.* from a join b on a.b_id = b.id I would like to know the columns that would be returned by the query. Not just their names, but also which table those columns live in. For the above example this might be: a.id a.name a.b_id b.id b.name But you could imagine this getting significantly more complex for joins across many different tables. I want this so my software, Datasette, can display columns from specific tables in different ways, even when those columns are returned as part of an arbitrary query. My biggest constraint is that I am using the sqlite3 module from the Python standard library. It looks like this family of C functions: <https://www.sqlite.org/c3ref/column_database_name.html> can solve my problem, but those are not exposed in the Python library I am using. I wrote a bit more about my research into this here: <https://github.com/simonw/datasette/issues/1293> So three questions: 1. Is there a mechanism for this that I might be missing? 2. Would it be feasible to achieve this using a custom SQLite extension? One that exposes the column_database_name() C API functions in a way that I can use them to answer my question? I was thinking something like this: select database, table, column from analyze_query(:sql_query) 3. Feature request: could SQLite itself add some kind of SQL-level mechanism for accessing this information, such that even if the Python sqlite3 module doesn't add support this information would become available via a PRAGMA or similar?
For a given SQL query, mechanisms for figuring out which physical columns would be returned?
(1.1) By Simon Willison (simonw) on 2021-04-04 23:40:35 edited from 1.0 [link]
Given any SQL query - such as: select a.*, b.* from a join b on a.b_id = b.id I would like to know the columns that would be returned by the query. Not just their names, but also which table those columns live in. For the above example this might be: a.id a.name a.b_id b.id b.name But you could imagine this getting significantly more complex for joins across many different tables. I want this so my software, Datasette, can display columns from specific tables in different ways, even when those columns are returned as part of an arbitrary query. My biggest constraint is that I am using the sqlite3 module from the Python standard library. It looks like this family of C functions: <https://www.sqlite.org/c3ref/column_database_name.html> can solve my problem, but those are not exposed in the Python library I am using. I wrote a bit more about my research into this here: <https://github.com/simonw/datasette/issues/1293> So three questions: 1. Is there a mechanism for this that I might be missing? 2. Would it be feasible to achieve this using a custom SQLite extension? One that exposes the column_database_name() C API functions in a way that I can use them to answer my question? I was thinking something like this: select database, table, column from analyze_query( 'select foo, bar from baz' ) 3. Feature request: could SQLite itself add some kind of SQL-level mechanism for accessing this information, such that even if the Python sqlite3 module doesn't add support this information would become available via a PRAGMA or similar?
(2) By Simon Willison (simonw) on 2021-04-04 23:37:29 in reply to 1.0 [link]
Here's a shorter, better way to describe the problem I want to solve. Given an arbitrary SQL query, I would like to know: 1. The name of the columns that will be returned if I execute the query 2. For each of those returned columns, did it come from a specific column in a specific table? If so, which one? 3. If a column was generated by an expression (a group_concat() for example) I'm OK not knowing which columns from which table were involved in that expression Ideally I'd like to do this without executing the query (or while executing the query with a limit 0) but I'd be OK with a solution that did execute the query too.
(3) By Kees Nuyt (knu) on 2021-04-05 00:09:02 in reply to 2 [link]
In my projects, I write queries with an eye on the schema, so I always know where the results come from. Also, I never use SELECT x.* , because any added column will ruin my day. Anyway, you could try to interpret the output of `EXPLAIN SELECT`, perhaps in combination with `PRAGMA table_info()`, but it isn't easy, especially with aliased `rowid`. Example: ```sql CREATE TABLE a ( id INTEGER PRIMARY KEY NOT NULL , name TEXT , b_id INTEGER REFERENCES b(id) ); CREATE TABLE b ( id INTEGER PRIMARY KEY NOT NULL , name TEXT ); INSERT INTO b VALUES (2,'two'); INSERT INTO b VALUES (3,'three'); INSERT INTO a VALUES (1,'one',2); .mode column .headers on SELECT * FROM pragma_table_info('a') UNION ALL SELECT * FROM pragma_table_info('b'); EXPLAIN SELECT a.*, b.* FROM a JOIN b ON a.b_id = b.id; SELECT a.*, b.* FROM a JOIN b ON a.b_id = b.id; ``` Result: ``` cid name type notnull dflt_value pk --- ---- ------- ------- ---------- -- 0 id INTEGER 1 1 1 name TEXT 0 0 2 b_id INTEGER 0 0 0 id INTEGER 1 1 1 name TEXT 0 0 addr opcode p1 p2 p3 p4 p5 comment ---- ------------- ---- ---- ---- ------------- -- ------------- 0 Init 0 16 0 0 Start at 16 1 OpenRead 0 2 0 3 0 root=2 iDb=0; a 2 OpenRead 1 3 0 2 0 root=3 iDb=0; b 3 Explain 3 0 0 SCAN a 0 4 Rewind 0 15 0 0 5 Explain 5 0 0 SEARCH b USING INTEGER PRIMARY KEY (rowid=?) 0 6 Column 0 2 1 0 r[1]=a.b_id 7 SeekRowid 1 14 1 0 intkey=r[1] 8 Rowid 0 2 0 0 r[2]=rowid 9 Column 0 1 3 0 r[3]=a.name 10 Column 0 2 4 0 r[4]=a.b_id 11 Rowid 1 5 0 0 r[5]=rowid 12 Column 1 1 6 0 r[6]=b.name 13 ResultRow 2 5 0 0 output=r[2..6] 14 Next 0 5 0 1 15 Halt 0 0 0 0 16 Transaction 0 0 2 0 1 usesStmtJournal=0 17 Goto 0 1 0 0 id name b_id id name -- ---- ---- -- ---- 1 one 2 2 two ``` This probably doesn't solve your problem, but it might give some inspiration.
(5) By Simon Willison (simonw) on 2021-04-05 01:17:47 in reply to 3
That's really useful, thanks! It looks like it _might_ be possible for me to reconstruct where each column came from using the `explain select` output. Here's a complex example: <https://calands.datasettes.com/calands?sql=explain+select%0D%0A++AsGeoJSON%28geometry%29%2C+*%0D%0Afrom%0D%0A++CPAD_2020a_SuperUnits%0D%0Awhere%0D%0A++PARK_NAME+like+%27%25mini%25%27+and%0D%0A++Intersects%28GeomFromGeoJSON%28%3Afreedraw%29%2C+geometry%29+%3D+1%0D%0A++and+CPAD_2020a_SuperUnits.rowid+in+%28%0D%0A++++select%0D%0A++++++rowid%0D%0A++++from%0D%0A++++++SpatialIndex%0D%0A++++where%0D%0A++++++f_table_name+%3D+%27CPAD_2020a_SuperUnits%27%0D%0A++++++and+search_frame+%3D+GeomFromGeoJSON%28%3Afreedraw%29%0D%0A++%29&freedraw=%7B%22type%22%3A%22MultiPolygon%22%2C%22coordinates%22%3A%5B%5B%5B%5B-122.42202758789064%2C37.82280243352759%5D%2C%5B-122.39868164062501%2C37.823887203271454%5D%2C%5B-122.38220214843751%2C37.81846319511331%5D%2C%5B-122.35061645507814%2C37.77071473849611%5D%2C%5B-122.34924316406251%2C37.74465712069939%5D%2C%5B-122.37258911132814%2C37.703380457832374%5D%2C%5B-122.39044189453125%2C37.690340943717715%5D%2C%5B-122.41241455078126%2C37.680559803205135%5D%2C%5B-122.44262695312501%2C37.67295135774715%5D%2C%5B-122.47283935546876%2C37.67295135774715%5D%2C%5B-122.52502441406251%2C37.68382032669382%5D%2C%5B-122.53463745117189%2C37.6892542140253%5D%2C%5B-122.54699707031251%2C37.690340943717715%5D%2C%5B-122.55798339843751%2C37.72945260537781%5D%2C%5B-122.54287719726564%2C37.77831314799672%5D%2C%5B-122.49893188476564%2C37.81303878836991%5D%2C%5B-122.46185302734376%2C37.82822612280363%5D%2C%5B-122.42889404296876%2C37.82822612280363%5D%2C%5B-122.42202758789064%2C37.82280243352759%5D%5D%5D%5D%7D> It looks like the opcodes I need to inspect are `OpenRead`, `Column` and `ResultRow`. `OpenRead` tells me which tables are being opened - the `p2` value (in this case 51) corresponds to the `rootpage` column in `sqlite_master` here: <https://calands.datasettes.com/calands?sql=select+*+from+sqlite_master> - it gets assigned to the register in `p1`. The `Column` opcodes tell me which columns are being read - `p1` is that table reference, and `p2` is the `cid` of the column within that table. The `ResultRow` opcode then tells me which columns are used in the results. `15 16` means start at the 15th and then read the next 16 columns. I think this might work!
(6) By Kees Nuyt (knu) on 2021-04-05 02:34:36 in reply to 5 [link]
> I think this might work! Ok, great. Just one remark I should have made before: The output format of `EXPLAIN` is not stable, it may change over versions. Also, it lists the "byte code" for the virtual machine that executes the query, so if the optimizer get smarter, the code may change. It might be stable enough for your purpose though.
(8) By Simon Willison (simonw) on 2021-04-05 03:39:11 in reply to 6 updated by 8.1 [link]
I have enough of a proof of concept implementation in https://github.com/simonw/datasette/issues/1293#issuecomment-813162622 that I'm pretty confident I can get this to work. I'm ready to take my chances with regards to stability - this feature is a nice-to-have, so if future changes break it I can work around that. Thanks for the help!
(8.1) By Simon Willison (simonw) on 2021-04-05 03:39:25 edited from 8.0 in reply to 6 [link]
I have enough of a proof of concept implementation in <https://github.com/simonw/datasette/issues/1293#issuecomment-813162622> that I'm pretty confident I can get this to work. I'm ready to take my chances with regards to stability - this feature is a nice-to-have, so if future changes break it I can work around that. Thanks for the help!
(4) By Simon Slavin (slavin) on 2021-04-05 01:16:49 in reply to 1.1 [link]
Can't solve your problem, but how would you expect whatever solution results to respond to <code>SELECT age-1 FROM members</code> or <code>SELECT 365*20 FROM members</code> or <code>SELECT invoice.purch_amount * items.price FROM invoice JOIN items ON item.id = invoice.item</code>
(7) By Simon Willison (simonw) on 2021-04-05 03:38:07 in reply to 4 [link]
For those cases I'm fine with it not being able to identify the column that is being used in the expression. I only want to be able to detect columns that are used as-is, without being part of a formula.