How to unpivot a cross-table with “changing“ numbers of columns and column names?

(1) By anonymous on 2022-03-09 00:11:36 [link] [source]

Dear Forum members:

I have a cross-table from a CSV import:

Key	2021-12-31	2022-01-31	2022-02-28
======	=========	==========	=========
K1	val11	val12	val13
K2	val21	val22	val23

Which I like to “unpivot“:

Key	Date	Value
======	===========	=====
K1	2021-12-31	val11
K1	2022-01-31	val12
K1	2022-02-28	val13
K2	2021-12-31	val21
K2	2022-01-31	val22
K2	2022-02-28	val23

I understand I can do it with UNION ALL

As the imported cross-table varies in terms of number of columns and their names, I am looking for a query that can handle it without manually adjusting to the changes.

SELECT cid, name FROM PRAGMA_table_info(“crosstable“)

gives me the column names and column IDs of the cross-table:

cid	name
=====	===========
0	Key
1	2021-12-31
2	2022-01-31
3	2022-02-28

I was hoping to be able to iterate through columns accordingly for UNION ALL (with recursive CTE?)

But I failed: SQLite would the take the query result from PRAGMA, e.g. “2021-12-31”, not as the name of a column in the cross-table but as a value.

How to solve it? e.g. I read about @variables and dynamic SQL

Have a great day!

(2) By Larry Brasfield (larrybr) on 2022-03-09 07:44:08 in reply to 1 [link] [source]

I am tempted to say you are trying to solve the wrong problem. The real problem is that someone who knows little or nothing about database normalization has gotten the concept of choosing column names mixed up with the concept of storing data. Normally, when normalization has been somewhat in a schema designer's mind, the column names are not the data; rather they help humans understand the data so named.

My solution would be to go to the source of that ill-formed data and see where they got it in the hope of getting better data that has not been mangled by spreadsheet futzers.

However, taking your problem at face value, here is an approach:

Use "SELECT ... FROM pragma_table_xinfo('SomeMalFormedTable')" to get those "names" into some result rows, (where they are data once again.) Then, arrange to build queries and/or DDL/DML, using SQL on those results and the messed-up data. The said SQL can be executed using the eval() extension. It's going to be plenty of work to get this "right", probably more work than going to a better data source or persuading the source of that mess to learn about normalization instead of ignorantly doing pointy/clicky at a spreadsheet program.

(5) By Gisbert (gisbert) on 2022-03-09 12:23:22 in reply to 2 [link] [source]

In principle I concur with your analysis, but I would not belittle the creators of such data formats as mere "spreasdsheet futzers". Such data formats do have good uses in statistical (both visual and numeric) analysis. E.g., in the "tidy data movement" that can be found in the R (referring to the statistics software) community, there are easy flows between "wide format" ("crosstable" as shown by the OP) and "long format" (as desired by the OP), exactly because sometimes one format is more conducive, and sometimes the other. Neither the creators nor the users of these R tools are, in general, "futzers", they tend to be professionals who (at least often) know their database theory quite well.

That being said, arguably the "long format" is more natural to storage in databases, and the "wide format" may better be restricted to use on the "last mile" of analysis. Then again, perception of where that "last mile" starts may vary.

(8) By Larry Brasfield (larrybr) on 2022-03-09 19:53:59 in reply to 5 [link] [source]

I accept your criticism. I did not mean to diminish the value of reports or those who generate them for human consumption. If it had not been a diversion, I might have said more on the possibility that the original machine-readable form of that data was so non-data-like, created by somebody who does not understand the difference between data fit for downstream processes and reports fit only for human perusal.

Not all spreadsheets result from futzing, but I've seen too many that do.

My point was: Go the genuine data source, and if that is somebody who assembles it in such poor form for processing, see if a more normal form can be induced.

(3) By ddevienne on 2022-03-09 08:11:38 in reply to 1 [link] [source]

I concur with Larry.

cross-table varies in columns and names, looking for query without manually adjusting

No can do, in SQL. SQL is always static and fixed in the names
(tables, columns) it is using. While you have variable number of columns.

How to solve it? e.g. I read about @variables and dynamic SQL

Again, regular SQL cannot do that, but dynamic SQL can, yes.

SQLite has no built-in support, but there's the eval() extension,
as already mentioned by Larry, and also the statement-virtual-table one, as well.

Both are very powerful, but no easy feat to get right (again, as Larry already mentioned).

Of course, the other way to do dynamic SQL is in code (C, Python, TCL, etc...).

Up to you how to approach this. --DD

(4) By anonymous on 2022-03-09 10:12:59 in reply to 3 [link] [source]

Thank you both of you for the prompt replies.

Actually, it is a "standard export" from a famous ERP system showing the month's end value over a given period with a column for each month. Unfortunately, there is no way to change the structure of this table before the export (w/o customizing reports etc. which is not possible atm)

My current working solution is to use PowerQuery to "flatten" this cross-table before further use.

I was hoping to do the same with SQLite with a gain in execution speed.

As the column names and amount follow a certain logic, the manual adjustment is not to much work, while I prefer a more "elegant" solution w/o it.

Edit: I understand your point and may have another idea how to get a normalized structure to work with

Thanks again!

(6) By Ryan Smith (cuz) on 2022-03-09 13:59:27 in reply to 4 [link] [source]

To add to the previous warnings...

For reasons unimportant here, I'm rather familiar with most "famous" ERP systems and can promise you that almost all ERP systems, famous or otherwise, have a database backend which you can retrieve any information from by simply querying it.

It would almost certainly be more prudent to gain your data from a direct query rather than interpreting the output of a report, which can have all kinds of pitfalls, such as localization, formatting and the like - even in CSV exports - which can break whatever remedy you employ to decode the current export.

If your difficulty is making a query that gets specific data from a specific ERP system, that's quite understandable and I'm willing to assist with that too, but perhaps via direct email since that would be a tad off-topic here.

(7.1) By midijohnny on 2022-03-09 19:14:05 edited from 7.0 in reply to 1 [link] [source]

I agree with other postings here - better to clean-up the structure before handing it over to the DB.

However ; if this is a nearly one-off thing to do - the following seems to work (check it please) - just using the sqlite CLI itself - using temporary files to generate and read dynamic SQL.

Its pretty ugly - although could be cleaned-up a bit to make it more acceptable probably.

# filename: 'gen_dyn.sql'

CREATE TABLE crosstable(key,"2022-12-31","2022-01-31","2022-02-28");

INSERT INTO crosstable VALUES('K1','val11','val12','val13');
INSERT INTO crosstable VALUES('K2','val21','val22','val23');

DROP VIEW IF EXISTS v_dynamic;
CREATE VIEW v_dynamic AS
WITH
  template(t) AS
    (SELECT 'SELECT key, %Q AS date, "%s" AS value FROM crosstable'),
  params(colname) AS
    (SELECT name FROM pragma_table_info('crosstable') WHERE name NOT IN('name','key')),
  dynamic(sql) AS
	(SELECT PRINTF(t,colname,colname) FROM template CROSS JOIN params)
SELECT GROUP_CONCAT(sql,PRINTF('%cUNION ALL%c',char(10),char(10)))||PRINTF(';%c',char(10)) AS sql FROM dynamic;

.mode column
.head off
.shell rm -rf /tmp/dyn.sql
.once /tmp/dyn.sql
SELECT * FROM v_dynamic;
.head on
.read /tmp/dyn.sql

Output:

sqlite> .version
SQLite 3.31.1 2020-01-27 19:55:54 [...]
sqlite> .read gen_dyn.sql
key         date        value     
----------  ----------  ----------
K1          2022-12-31  val11     
K2          2022-12-31  val21     
K1          2022-01-31  val12     
K2          2022-01-31  val22     
K1          2022-02-28  val13     
K2          2022-02-28  val23

Content of dyn.sql:

SELECT key, '2022-12-31' AS date, "2022-12-31" AS value FROM crosstable
UNION ALL
SELECT key, '2022-01-31' AS date, "2022-01-31" AS value FROM crosstable
UNION ALL
SELECT key, '2022-02-28' AS date, "2022-02-28" AS value FROM crosstable;

(9) By mgr (mgrmgr) on 2022-03-09 20:25:37 in reply to 1 [link] [source]

Another way in the sqlite-cli, thanks to its capabilities to format its output as json. And the very versatile json_tree virtual table:

CREATE TABLE crosstable(Key, "2021-12-31", "2022-01-31", "2022-02-28");
INSERT INTO crosstable VALUES('K1',1,2,3), ('K2',11,22,33), ('K3',111,222,333);

--write out to tmp.json
.mode json
.once 'tmp.json'
SELECT * FROM crosstable;

--read back and massage it using json_tree
CREATE TABLE longtable AS
WITH jt AS (SELECT * FROM json_tree(readfile('tmp.json')))
, k AS (SELECT value, id, parent FROM jt WHERE key='Key')
SELECT k.value AS k, jt.key AS d, jt.value AS v
FROM k LEFT JOIN jt
ON jt.parent=k.parent AND jt.id<>k.id;

--looks good
.mode qbox
SELECT * FROM longtable;

┌──────┬──────────────┬─────┐
│  k   │      d       │  v  │
├──────┼──────────────┼─────┤
│ 'K1' │ '2021-12-31' │ 1   │
│ 'K1' │ '2022-01-31' │ 2   │
│ 'K1' │ '2022-02-28' │ 3   │
│ 'K2' │ '2021-12-31' │ 11  │
│ 'K2' │ '2022-01-31' │ 22  │
│ 'K2' │ '2022-02-28' │ 33  │
│ 'K3' │ '2021-12-31' │ 111 │
│ 'K3' │ '2022-01-31' │ 222 │
│ 'K3' │ '2022-02-28' │ 333 │
└──────┴──────────────┴─────┘

(10) By midijohnny on 2022-03-09 20:54:55 in reply to 9 [link] [source]

Very nice - and avoids all that messy PRINTF formatting. It also prompted me to download a later version of SQlite, as I'm missing both the json '.mode' and the nice 'qbox' output format !

(11) By midijohnny on 2022-03-09 21:23:34 in reply to 10 [link] [source]

For others: this version worked for '.mode json' and '.mode qbox' (I know the latter is just to prettify the output):

SQLite version 3.38.0 2022-02-22 18:58:40

Question...do we even need to write-out the rows as JSON to a file in the first place with this?

Or is there a JSON function which could do the equivalent row-data -> JSON conversion ?

I'm not sure that would be possible, since we would need something like :

-- not real
SELECT TO_JSON(*) FROM table;
-- or at least:
SELECT TO_JSON(col1,col2) FROM table;

That is: somesort of 'varargs' needed here for that to work?

(12) By Larry Brasfield (larrybr) on 2022-03-09 21:55:51 in reply to 11 [link] [source]

There is not presently any execute-time varargs-like feature in SQLite's function invocation syntax. There is something like that which works at prepare-time. See min() and max() for examples.

(14.1) By midijohnny on 2022-03-09 22:53:43 edited from 14.0 in reply to 12 [link] [source]

Thanks for this.

I was aware that MIN/MAX have an overloaded version (non-aggregate) like this:

sqlite> create table x(a,b,c);
sqlite> insert into x values(0,1,2);
sqlite> select min(a,b,c) from x;
0

(Very useful that is too).

It doesn't (as you say) support a wildcard-type format; like:

-- doesn't work
sqlite> select min(*) from x;
Error: wrong number of arguments to function min()

Pretty edge case I guess ... and probably flies in the face of SQL specs (I dunno, I haven't checked) to use * anywhere but in COUNT(*) (I'm not aware of any other occurrences, but maybe I'm wrong).

I have also just realised another issue with my request;

.mode json

(for instance) - completely 'wraps' up the entire result-set - so includes a leading '[' and trailing ']' around all the results.

That would be an odd thing to have an equivalent as a SQL function I guess: where the output is not only the row-data, but some extra formatting/syntax.

I can imagine (again, pretty edge case - I'm just thinking aloud) a function specifically designed to produce text output, having such header/footer arguments for convenience (it could be used for instance in the generation of other formats) - a sorta of super 'group_concat' - but with options to convert to JSON etc.

(13) By mgr (mgrmgr) on 2022-03-09 22:46:39 in reply to 11 [link] [source]

Well, there is json_object(), but you have to list all the fields explicitly. So the equivalent of .mode json and select * from crosstable would be

select json_group_array(json_object('Key',Key,'2021-12-13',"2021-12-31",'2022-01-31',"2022-01-31",'2022-02-28',"2022-02-28")) from crosstable;

onto which you could directly bolt a json_tree(...). But there went the dynamic part of '*' and you have to list all your columns - even twice ...

(15.1) By midijohnny on 2022-03-09 22:52:30 edited from 15.0 in reply to 13 [link] [source]

Great - thanks for this.

I see this takes care of wrapping everything up in a list/array "[...]".

[{"Key":"K1","2021-12-13":"2021-12-31","2022-01-31":"val12","2022-02-28":"val13"},{"Key":"K2","2021-12-13":"2021-12-31","2022-01-31":"val22","2022-02-28":"val23"},{"Key":"K1","2021-12-13":"2021-12-31","2022-01-31":2,"2022-02-28":3},{"Key":"K2","2021-12-13":"2021-12-31","2022-01-31":22,"2022-02-28":33},{"Key":"K3","2021-12-13":"2021-12-31","2022-01-31":222,"2022-02-28":333}]

Maybe a '*' arg to json_object might be of some value in the future - not sure if this is really anything but an edge case I'm currently now obsessing with ! :-)

EDIT: nope - I didn't appreciate the problem with the duplication in the column names that you pointed out.

Anyways: all good stuff - I need to get onto learning those various JSON functions - they look quite handy.

(16) By mgr (mgrmgr) on 2022-03-09 23:10:15 in reply to 13 [link] [source]

Or you can avoid a named temporary file by .read-ing from a pipeline from a second sqlite3 - rather contrived, though:

-- cobble together a select with json_object(...) and have that command as single column of a view ...
create view cc as
select printf('create view jj as select json_group_array(json_object(%s)) as j from crosstable;', group_concat(printf('%Q,%s',name,json_quote(name)),',')) as c from pragma_table_info('crosstable');

select c from cc;
┌──────────────────────────────────────────────────────────────┐
│                              c                               │
├──────────────────────────────────────────────────────────────┤
│ create view jj as select json_group_array(json_object('Key', │
│ "Key",'2021-12-31',"2021-12-31",'2022-01-31',"2022-01-31",'2 │
│ 022-02-28',"2022-02-28")) as j from crosstable;              │
└──────────────────────────────────────────────────────────────┘

-- .read that - by calling a second sqlite on ourself
.read '|sqlite3 this.db "select c from cc;"

-- now we have the json in the second view:
select j from jj;
┌──────────────────────────────────────────────────────────────┐
│                              j                               │
├──────────────────────────────────────────────────────────────┤
│ [{"Key":"K1","2021-12-31":1,"2022-01-31":2,"2022-02-28":3},{ │
│ "Key":"K2","2021-12-31":11,"2022-01-31":22,"2022-02-28":33}, │
│ {"Key":"K3","2021-12-31":111,"2022-01-31":222,"2022-02-28":3 │
│ 33}]                                                         │
└──────────────────────────────────────────────────────────────┘

and can bolt on the json_tree as in my first post:

CREATE TABLE longtable AS
WITH jt AS (SELECT * FROM jj,json_tree(j))
, k AS (SELECT value, id, parent FROM jt WHERE key='Key')
SELECT k.value AS k, jt.key AS d, jt.value AS v
FROM k LEFT JOIN jt
ON jt.parent=k.parent AND jt.id<>k.id;

So again without listing the columns explicitly ... but almost ununderstandable code.

(17) By midijohnny on 2022-03-09 23:30:33 in reply to 16 [link] [source]

Pretty hacky (I mean that as a compliment!) - but good to learn about the pipe-facility - something I wasn't aware of!

(18) By anonymous on 2022-03-10 04:49:01 in reply to 17 [link] [source]

While I understand and concur with Larry and ddevienne, I am happy to see where this thread is heading to.

Also kudos to midijohnny and mgrmgr! Definitely good learning about the "hacky" side of SQLite.

This report format was created many years ago and the original purpose was to be consumed online within the ERP, so "wide format" makes sense.

Obviously, there are a numbers of ways to get a "right" report from the ERP directly, as pointed out by Ryan. Due to various constraints, for the time being, I have to use what is available, meaning no chance to create another query or to change any output layout in the ERP to get the CSV with a normalized structure.

For my use case, PowerQuery has been serving me well to clean and normalize multiple files (incl. this report) but it would take it's time when it comes to join various tables. Therefore, I was curious which step(s) of the current workflow could be covered by SQLite for more efficiency or flexibility.

Thanks again!

P.S. as one (more time consuming) way to avoid the cross-table, I played around with a simple RPA script to download one-by-one the reports with one month's column and to append them before importing to SQLite

(19) By John Dennis (jdennis) on 2022-03-10 09:54:38 in reply to 18 [link] [source]

PowerQuery has been serving me well to clean and normalize multiple files (incl. this report) but it would take it's time when it comes to join various tables

Could you not use PowerQuery to unpivot the file/s you are working with, producing a CSV, and import that/them into SQLite, and do your joining there?

(24) By anonymous on 2022-03-11 17:55:18 in reply to 19 [link] [source]

Yes. This works surely.

My original thought was whether it is possible to replace the steps in PowerQuery entirely with SQLite. Considering the above discussion, my new approach is to download one month’s values at a time and append to the normalized table in SQLite.

Before: ERP crosstable > RPA download > PowerQuery cleaning & unpivot (more flexible about the time span at the time of the download)

New: ERP table with one month’s data > RPA download and cleaning > (incrementally) import to SQLite (needs additional setup/download for the historical data once; benefit of normalized db)

Regardless my use case, I am genuinely interested to understand how unpivot can be accomplished with SQLite. It’s great to see the different solutions proposed and being further developed as well as another approach with some Python code.

Have a nice weekend!

(23.1) By mgr (mgrmgr) on 2022-03-11 12:35:56 edited from 23.0 in reply to 17 [link] [source]

Some more notes/insight/ideas on the one-off/hacky route (using just the sqlite cli & JSON stuff):

the outer JSON-array is not necessarily needed, might even be easier to use json_each instead of json_tree
the pipe functionality '|...' does work for input (.read / .import) and output (.output / .once)

Lets try ...

part 0

example data table as before

CREATE TABLE crosstable(Key, "2021-12-31", "2022-01-31", "2022-02-28");
INSERT INTO crosstable VALUES('K1',1,2,3), ('K2',11,22,33), ('K3',111,222,333);

part 1

Create a view with the key column(s) directly and all the value columns as JSON object. Code created dynamically base on selection of table_info() and executed by piping into sqlite cli the database itself.

.mode list
.headers off
.once '|sqlite3 myself.db'

select 'create view crosstable_json as select Key, json_object(' || 
  group_concat(printf('%Q,"%w"',name,name)) || 
  ') as j from crosstable;' as code
from pragma_table_info('crosstable')
where name not in ('Key');

looks good:

.mode box
select * from crosstable_json;
┌─────┬──────────────────────────────────────────────────────┐
│ Key │                          j                           │
├─────┼──────────────────────────────────────────────────────┤
│ K1  │ {"2021-12-31":1,"2022-01-31":2,"2022-02-28":3}       │
│ K2  │ {"2021-12-31":11,"2022-01-31":22,"2022-02-28":33}    │
│ K3  │ {"2021-12-31":111,"2022-01-31":222,"2022-02-28":333} │
└─────┴──────────────────────────────────────────────────────┘

part 2

Create the "long" view using json_each for every row.

create view crosstable_long as 
select v.key as k, je.key as d, je.value as v
from crosstable_json v, json_each(j) je;

select * from crosstable_long;
┌────┬────────────┬─────┐
│ k  │     d      │  v  │
├────┼────────────┼─────┤
│ K1 │ 2021-12-31 │ 1   │
│ K1 │ 2022-01-31 │ 2   │
│ K1 │ 2022-02-28 │ 3   │
│ K2 │ 2021-12-31 │ 11  │
│ K2 │ 2022-01-31 │ 22  │
│ K2 │ 2022-02-28 │ 33  │
│ K3 │ 2021-12-31 │ 111 │
│ K3 │ 2022-01-31 │ 222 │
│ K3 │ 2022-02-28 │ 333 │
└────┴────────────┴─────┘

done & more

Should the structure of the "wide" table change, only part 1 needs to be rerun and everything is fine - pathological column names included

alter table crosstable drop column "2021-12-31";
alter table crosstable add column "2022-03-31" default -1;
alter table crosstable add column [2022"04'30] default 'weird column name';

drop view if exists crosstable_json;

.mode list
.headers off
.once '|sqlite3 myself.db'

select 'create view crosstable_json as select Key, json_object(' || 
  group_concat(printf('%Q,"%w"',name,name)) || 
  ') as j from crosstable;' as code
from pragma_table_info('crosstable')
where name not in ('Key');

.mode box
select * from crosstable_long;
┌────┬────────────┬───────────────────┐
│ k  │     d      │         v         │
├────┼────────────┼───────────────────┤
│ K1 │ 2022-01-31 │ 2                 │
│ K1 │ 2022-02-28 │ 3                 │
│ K1 │ 2022-03-31 │ -1                │
│ K1 │ 2022"04'30 │ weird column name │
│ K2 │ 2022-01-31 │ 22                │
│ K2 │ 2022-02-28 │ 33                │
│ K2 │ 2022-03-31 │ -1                │
│ K2 │ 2022"04'30 │ weird column name │
│ K3 │ 2022-01-31 │ 222               │
│ K3 │ 2022-02-28 │ 333               │
│ K3 │ 2022-03-31 │ -1                │
│ K3 │ 2022"04'30 │ weird column name │
└────┴────────────┴───────────────────┘

(20) By Keith Medcalf (kmedcalf) on 2022-03-10 23:05:25 in reply to 1 [link] [source]

I would write some code to do what I wanted:

import csv
import sqlite3
import sys

if len(sys.argv) != 4:
    print('Exactly three parameters are required.')
    print('Param 1 : Name of Input CSV File')
    print('Param 2 : Name of the Database File')
    print('Param 3 : Name of the Table in the Database to put the data in')
    raise SystemExit

Filename = sys.argv[1]
dbFilename = sys.argv[2]
dbTablename = sys.argv[3]

fi = open(Filename, 'r', newline='', errors='replace')
csvi = csv.reader(fi)

db = sqlite3.connect(dbFilename, isolation_level=None)
db.execute('drop table if exists ' + dbTablename)
db.execute('create table ' + dbTablename + ' ( row text not null, col text not null, value numeric, primary key(row,col), unique(col,row,value)) without rowid')
db.execute('begin immediate')
cols = []
firstrow = True
sql = 'INSERT INTO ' + dbTablename + ' VALUES (?, ?, ?)'
for row in csvi:
    if firstrow:
        cols = row[1:]
        firstrow = False
    else:
        for i, v in enumerate(cols):
            db.execute(sql, (row[0], v, row[i+1]))
db.execute('COMMIT')
db.close()

Which will take source that looks like this:

Key,2021-12-31,2022-01-31,2022-02-28
K1,val11,val12,val13
K2,val21,val22,val23

as the input csv file, and, given the command line:

unpivot.py source.csv x.db test

The the table test in x.db contains:

>sqlite3 x.db
-- Loading resources from C:\Users\KMedcalf/.sqliterc
SQLite version 3.39.0 2022-03-10 01:24:07
Enter ".help" for usage hints.
sqlite> select * from test;
┌─────┬────────────┬───────┐
│ row │    col     │ value │
├─────┼────────────┼───────┤
│ K1  │ 2021-12-31 │ val11 │
│ K2  │ 2021-12-31 │ val21 │
│ K1  │ 2022-01-31 │ val12 │
│ K2  │ 2022-01-31 │ val22 │
│ K1  │ 2022-02-28 │ val13 │
│ K2  │ 2022-02-28 │ val23 │
└─────┴────────────┴───────┘

As long as your source is square then it will work. Missing values (unsquareness) is not handled (except by explosion).

(21) By Keith Medcalf (kmedcalf) on 2022-03-11 03:58:39 in reply to 20 [link] [source]

The pivot_vtab extension https://github.com/jakethaw/pivot_vtab can use that table to generate the original table:

>sqlite x.db
-- Loading resources from C:\Users\KMedcalf/.sqliterc
SQLite version 3.39.0 2022-03-11 00:51:53
Enter ".help" for usage hints.
sqlite> create virtual table original using pivot_vtab(
   ...>     (
   ...>        select row as Key
   ...>          from test
   ...>      group by row
   ...>      order by row
   ...>     ),
   ...>     (
   ...>       select col, col
   ...>         from test
   ...>      group by col
   ...>      order by col
   ...>     ),
   ...>     (
   ...>        select value
   ...>          from test
   ...>         where row == ?1
   ...>           and col == ?2
   ...>     )
   ...> );
sqlite> select * from original;
┌─────┬────────────┬────────────┬────────────┐
│ Key │ 2021-12-31 │ 2022-01-31 │ 2022-02-28 │
├─────┼────────────┼────────────┼────────────┤
│ K1  │ val11      │ val12      │ val13      │
│ K2  │ val21      │ val22      │ val23      │
└─────┴────────────┴────────────┴────────────┘

(22.4) By Keith Medcalf (kmedcalf) on 2022-03-11 21:24:08 edited from 22.3 in reply to 20 [link] [source]

Here is python code that will "convert" an input CSV crosstab into a sparse relational representation for manipulation, and defines a virtual table (pivot_vtab) that re-creates the original crosstab.


import csv
import os.path
import sqlite3
import sys

# Check and process parameters

if len(sys.argv) != 3:
    print('unpivot <csv_filename> <sqlite_db_filename>')
    print(' Param 1 : Name of Input CSV File')
    print(' Param 2 : Name of the Database File')
    print()
    print('Read the specified CSV file and assume that it is a cross-tabulation')
    print('Using the base file name of the CSV file to generate the following:')
    print(' <basename>Raw  -- a sparse data table containing the input cell data,')
    print(' <basename>Cols -- a table containing all the column names,')
    print(' <basename>Rows -- a table containing all the row names,')
    print(' <basename>Data -- an updatable view of the above data,')
    print(' <basename>     -- a pivot_vtab virtual table that generates the input table')
    raise SystemExit

Filename = sys.argv[1]
dbFilename = sys.argv[2]
dbTablename = os.path.splitext(os.path.basename(Filename))[0]
dbDatatable = dbTablename + 'Raw'
dbColtable = dbTablename + 'Cols'
dbRowtable = dbTablename + 'Rows'
dbViewname = dbTablename + 'Data'

# Open the files we need to work with

fi = open(Filename, 'r', newline='', errors='replace')
csvi = csv.reader(fi)

db = sqlite3.connect(dbFilename, isolation_level=None)

# Process all this in a transaction so either all changes are made or none are made

db.execute('begin immediate')

# drop any objects that we will be creating

db.execute('drop table if exists ' + dbTablename)
db.execute('drop view  if exists ' + dbViewname)
db.execute('drop table if exists ' + dbDatatable)
db.execute('drop table if exists ' + dbColtable)
db.execute('drop table if exists ' + dbRowtable)

# build the database schema

db.executescript("""
create table """ + dbDatatable + """
(
    rn      integer not null references """ + dbRowtable + """(rn) on delete cascade on update cascade,
    cn      integer not null references """ + dbColtable + """(cn) on delete cascade on update cascade,
    value   numeric not null,
    primary key (rn, cn),
    unique (cn, rn, value)
) without rowid
;
create table """ + dbRowtable + """
(
    rn      integer primary key,
    row     text not null /* collate nocase */ unique
)
;
create table """ + dbColtable + """
(
    cn      integer primary key,
    col     text not null /* collate nocase */ unique
)
;
create view """ + dbViewname + """
as select row,
          col,
          value
     from """ + dbDatatable + """ as Data
left join """ + dbRowtable + """ as Rows on Rows.rn == Data.rn
left join """ + dbColtable + """ as Cols on Cols.cn == Data.cn
;
create trigger """ + dbViewname + """_Insert instead of insert on """ + dbViewname + """
begin
  insert or ignore into """ + dbRowtable + """(row) values (new.row);
  insert or ignore into """ + dbColtable + """(col) values (now.col);
  insert into """ + dbDatatable + """ values ((select rn from """ + dbRowtable + """ where row == new.row),
                                              (select cn from """ + dbColtable + """ where col == new.col),
                                              new.value);
end
;
create trigger """ + dbViewname + """_delete instead of delete on """ + dbViewname + """
begin
  delete from """ + dbDatatable + """
       where rn == (select rn from """ + dbRowtable + """ where row == old.row)
          and cn == (select cn from """ + dbColtable + """ where col == old.col);
  delete from """ + dbRowtable + """
        where row == old.row
          and not exists (
                          select *
                            from """ + dbDatatable + """
                           where rn == """ + dbDatatable + """.rn
                         );
  delete from """ + dbColtable + """
        where col == old.col
          and not exists (
                          select *
                            from """ + dbDatatable + """
                           where cn == """ + dbDatatable + """.cn
                         );
end
;
create trigger """ + dbViewname + """_update instead of update on """ + dbViewname + """
begin
  update """ + dbDatatable + """
     set value == new.value
   where old.value is not new.value
     and rn == (select rn from """ + dbRowtable + """ where row == old.row)
     and cn == (select cn from """ + dbColtable + """ where col == old.col);
  insert or ignore into """ + dbRowtable + """(row) select new.row where new.row is not old.row;
  insert or ignore into """ + dbColtable + """(col) select new.col where new.col is not old.col;
  update """ + dbDatatable + """
     set rn = (select rn from """ + dbRowtable + """ where row == new.row),
         cn = (select cn from """ + dbColtable + """ where col == new.col)
   where (old.row is not new.row or old.col is not new.col)
     and rn == (select rn from """ + dbRowtable + """ where row == old.row)
     and cn == (select cn from """ + dbColtable + """ where col == old.col);
  delete from """ + dbRowtable + """
        where row == old.row
          and not exists (
                          select *
                            from """ + dbDatatable + """
                           where rn == """ + dbDatatable + """.rn
                         );
  delete from """ + dbColtable + """
        where col == old.col
          and not exists (
                          select *
                            from """ + dbDatatable + """
                           where cn == """ + dbDatatable + """.cn
                         );
end
;
""")

# read the CSV data and store it in the database

cols = []
firstrow = True
sql1 = 'insert into ' + dbColtable + ' values (?,?)'
sql2 = 'insert into ' + dbRowtable + '(row) values (?)'
sql3 = 'select rn from ' + dbRowtable + ' where row == ?'
sql4 = 'INSERT INTO ' + dbDatatable + ' VALUES (?, ?, ?)'
for row in csvi:
    if firstrow:
        slug = row[0]
        cols = row[1:]
        firstrow = False
        for i, v in enumerate(cols):
            db.execute(sql1, (i+1, v))
    else:
        db.execute(sql2, (row[0], ))
        rn = db.execute(sql3, (row[0],)).fetchone()[0]
        for i, v in enumerate(cols):
            try:
                if len(row[i+1]) > 0:
                    db.execute(sql4, (rn, i+1, row[i+1]))
            except:
                pass
fi.close()
db.commit()

db.execute("""
create virtual table """ + dbTablename + """ using pivot_vtab(
    (
       select row as '""" + slug + """'
         from """ + dbRowtable + """
     order by rn
    ),
    (
      select col, col
         from """ + dbColtable + """
     order by cn
    ),
    (
       select value
         from """ + dbViewname + """
        where row == ?1
          and col == ?2
    )
)
""")

db.close()

Processing the source.csv file results in the following schema:

sqlite> .schema
CREATE TABLE sourceRaw
(
    rn      integer not null references sourceRows(rn) on delete cascade on update cascade,
    cn      integer not null references sourceCols(cn) on delete cascade on update cascade,
    value   numeric not null,
    primary key (rn, cn),
    unique (cn, rn, value)
) without rowid
;
CREATE TABLE sourceRows
(
    rn      integer primary key,
    row     text not null /* collate nocase */ unique
);
CREATE TABLE sourceCols
(
    cn      integer primary key,
    col     text not null /* collate nocase */ unique
);
CREATE VIEW sourceData
as select row,
          col,
          value
     from sourceRaw as Data
left join sourceRows as Rows on Rows.rn == Data.rn
left join sourceCols as Cols on Cols.cn == Data.cn
/* sourceData("row",col,value) */;
CREATE TRIGGER sourceData_Insert instead of insert on sourceData
begin
  insert or ignore into sourceRows(row) values (new.row);
  insert or ignore into sourceCols(col) values (now.col);
  insert into sourceRaw values ((select rn from sourceRows where row == new.row),
                                              (select cn from sourceCols where col == new.col),
                                              new.value);
end;
CREATE TRIGGER sourceData_delete instead of delete on sourceData
begin
  delete from sourceRaw
       where rn == (select rn from sourceRows where row == old.row)
          and cn == (select cn from sourceCols where col == old.col);
  delete from sourceRows
        where row == old.row
          and not exists (
                          select *
                            from sourceRaw
                           where rn == sourceRaw.rn
                         );
  delete from sourceCols
        where col == old.col
          and not exists (
                          select *
                            from sourceRaw
                           where cn == sourceRaw.cn
                         );
end;
CREATE TRIGGER sourceData_update instead of update on sourceData
begin
  update sourceRaw
     set value == new.value
   where old.value is not new.value
     and rn == (select rn from sourceRows where row == old.row)
     and cn == (select cn from sourceCols where col == old.col);
  insert or ignore into sourceRows(row) select new.row where new.row is not old.row;
  insert or ignore into sourceCols(col) select new.col where new.col is not old.col;
  update sourceRaw
     set rn = (select rn from sourceRows where row == new.row),
         cn = (select cn from sourceCols where col == new.col)
   where (old.row is not new.row or old.col is not new.col)
     and rn == (select rn from sourceRows where row == old.row)
     and cn == (select cn from sourceCols where col == old.col);
  delete from sourceRows
        where row == old.row
          and not exists (
                          select *
                            from sourceRaw
                           where rn == sourceRaw.rn
                         );
  delete from sourceCols
        where col == old.col
          and not exists (
                          select *
                            from sourceRaw
                           where cn == sourceRaw.cn
                         );
end;
CREATE VIRTUAL TABLE source using pivot_vtab(
    (
       select row as 'Key'
         from sourceRows
     order by rn
    ),
    (
      select col, col
         from sourceCols
     order by cn
    ),
    (
       select value
         from sourceData
        where row == ?1
          and col == ?2
    )
)
/* source("Key","2021-12-31","2022-01-31","2022-02-28") */;

Edited to fix Source.csv -> source.csv
Note you may want to have/remove on delete cascade on update cascade after each of the references clauses to allow entire columns and rows to be moved or deleted. Or not.
Edited changed pivot_vtab to get information from the view which appears to work better for re-ordering rows and columns.
Note Cells in the input csv which are "empty" are not stored. Columns must have a name to be recognized. The source data does not need to be square -- missing columns at the end of each row are ignored.
Note The pivot_vtab runs the column query when the vtable is created (that is, when the create virtual table statement is processed. If you add or remove or re-order columns then you must "reparse" the schema. There are several ways to accomplish this.

(25) By anonymous on 2022-03-15 07:04:06 in reply to 1 [source]

I had a need for something similar, so I wrote the following C code (this code is public domain):

// CSV2D extension
// Public domain

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "sqlite3ext.h"

static const sqlite3_api_routines*sqlite3_api;

typedef struct {
  sqlite3_vtab_cursor cursor;
  FILE*fp;
  sqlite3_int64 r,c,lim;
  sqlite3_str*v;
  char limited,nextrec,eof;
} Cursor;

static int xConnect(sqlite3*db,void*aux,int argc,const char*const*argv,sqlite3_vtab**vt,char**err) {
  int i=sqlite3_declare_vtab(db,"CREATE TABLE `CSV`(`R` INT NOT NULL, `C` INT NOT NULL, `V` TEXT NOT NULL, `FILE` TEXT HIDDEN);");
  if(i) return i;
  *vt=sqlite3_malloc(sizeof(sqlite3_vtab));
  return *vt?SQLITE_OK:SQLITE_NOMEM;
}

static int xDisconnect(sqlite3_vtab*vt) {
  sqlite3_free(vt);
  return SQLITE_OK;
}

static int xBestIndex(sqlite3_vtab*vt,sqlite3_index_info*inf) {
 int i,j;
  for(i=0;i<inf->nConstraint;i++) {
    if(inf->aConstraint[i].usable && inf->aConstraint[i].iColumn==3 && inf->aConstraint[i].op==SQLITE_INDEX_CONSTRAINT_EQ) {
      inf->aConstraintUsage[i].argvIndex=1;
      inf->aConstraintUsage[i].omit=1;
      goto found;
    }
  }
  return SQLITE_CONSTRAINT;
  found:
  if(inf->nOrderBy==2 && inf->aOrderBy[0].iColumn==0 && !inf->aOrderBy[0].desc && inf->aOrderBy[1].iColumn==1 && !inf->aOrderBy[1].desc) inf->orderByConsumed=1;
  if(inf->nOrderBy==1 && inf->aOrderBy[0].iColumn==0 && !inf->aOrderBy[0].desc) inf->orderByConsumed=1;
  if(inf->nOrderBy==1 && inf->aOrderBy[0].iColumn==-1 && !inf->aOrderBy[0].desc) inf->orderByConsumed=1;
  for(i=0;i<inf->nConstraint;i++) {
    if(inf->aConstraint[i].usable && inf->aConstraint[i].iColumn==0) {
      j=inf->aConstraint[i].op;
      if(j==SQLITE_INDEX_CONSTRAINT_EQ || j==SQLITE_INDEX_CONSTRAINT_LE || j==SQLITE_INDEX_CONSTRAINT_LT || j==SQLITE_INDEX_CONSTRAINT_IS) {
        inf->aConstraintUsage[i].argvIndex=2;
        break;
      }
    }
  }
  return SQLITE_OK;
}

static int xOpen(sqlite3_vtab*vt,sqlite3_vtab_cursor**cur) {
  Cursor*c=(void*)sqlite3_malloc(sizeof(Cursor));
  if(!c) return SQLITE_NOMEM;
  c->fp=0;
  c->v=0;
  *cur=(void*)c;
  return SQLITE_OK;
}

static int xClose(sqlite3_vtab_cursor*cur) {
  Cursor*c=(void*)cur;
  if(c->fp) fclose(c->fp);
  if(c->v) sqlite3_str_reset(c->v),sqlite3_free(sqlite3_str_finish(c->v));
  sqlite3_free(cur);
  return SQLITE_OK;
}

static int xEof(sqlite3_vtab_cursor*cur) {
  Cursor*c=(void*)cur;
  return c->eof;
}

static int xNext(sqlite3_vtab_cursor*cur) {
  Cursor*c=(void*)cur;
  int x,y;
  FILE*fp=c->fp;
  if(c->nextrec) ++c->r,c->c=1; else ++c->c;
  if(c->limited && c->r>c->lim) {
    c->eof=1;
    return SQLITE_OK;
  }
  c->nextrec=0;
  sqlite3_str_reset(c->v);
  x=fgetc(fp);
  if(x=='\r') x=fgetc(fp);
  if(x=='"') {
    for(;;) {
      x=fgetc(fp);
      if(x==EOF) return SQLITE_CORRUPT_VTAB;
      if(x=='"') {
        switch(fgetc(fp)) {
          case '\r': case '\n': c->nextrec=1; goto done;
          case ',': case EOF: goto done;
          case '"': sqlite3_str_appendchar(c->v,1,'"'); break;
          default: return SQLITE_CORRUPT_VTAB;
        }
      } else {
        sqlite3_str_appendchar(c->v,1,x);
      }
    }
  } else if(x==EOF) {
    c->eof=1;
  } else {
    while(x!=EOF && x!=',' && x!='\r' && x!='\n') {
      if(x=='"') return SQLITE_CORRUPT_VTAB;
      sqlite3_str_appendchar(c->v,1,x);
      x=fgetc(fp);
    }
    if(x=='\r' || x=='\n') c->nextrec=1;
  }
  done:
  return ferror(fp)?SQLITE_IOERR:SQLITE_OK;
}

static int xFilter(sqlite3_vtab_cursor*cur,int num,const char*str,int argc,sqlite3_value**argv) {
  Cursor*c=(void*)cur;
  const char*name;
  c->limited=0;
  c->r=0;
  c->nextrec=1;
  c->eof=0;
  if(c->fp) fclose(c->fp),c->fp=0;
  if(c->v) sqlite3_str_reset(c->v); else c->v=sqlite3_str_new(0);
  if(!argc) return SQLITE_INTERNAL;
  if(argc==2 && sqlite3_value_type(argv[1])==SQLITE_INTEGER) {
    c->limited=1;
    c->lim=sqlite3_value_int64(argv[1]);
  }
  name=sqlite3_value_text(*argv);
  if(!name) return SQLITE_ERROR;
  c->fp=fopen(name,"r");
  if(!c->fp) return SQLITE_IOERR;
  return xNext(cur);
}

static int xColumn(sqlite3_vtab_cursor*cur,sqlite3_context*cxt,int n) {
  Cursor*c=(void*)cur;
  char*v;
  switch(n) {
    case 0: sqlite3_result_int64(cxt,c->r); break;
    case 1: sqlite3_result_int64(cxt,c->c); break;
    case 2:
      if(n=sqlite3_str_errcode(c->v)) return n;
      if(n=sqlite3_str_length(c->v)) {
        sqlite3_result_text(cxt,sqlite3_str_value(c->v),n,SQLITE_TRANSIENT);
      } else {
        sqlite3_result_text(cxt,"",0,0);
      }
      break;
  }
  return SQLITE_OK;
}

static int xRowid(sqlite3_vtab_cursor*cur,sqlite3_int64*id) {
  Cursor*c=(void*)cur;
  *id=ftell(c->fp);
  return SQLITE_OK;
}

static sqlite3_module my_module={
  .iVersion=1,
  .xConnect=xConnect,
  .xDisconnect=xDisconnect,
  .xBestIndex=xBestIndex,
  .xOpen=xOpen,
  .xClose=xClose,
  .xFilter=xFilter,
  .xNext=xNext,
  .xEof=xEof,
  .xColumn=xColumn,
  .xRowid=xRowid,
};

int sqlite3_extension_init(sqlite3*db,const char**err,const struct sqlite3_api_routines*api) {
  sqlite3_api=api;
  sqlite3_create_module(db,"CSV",&my_module,0);
  return SQLITE_OK;
}

You can use the following SQL code to achieve that (change test2.csv to the name of the CSV file):

select c.v "Key", a.v "Date", b.v "Value" from csv('test2.csv') a,csv('test2.csv') b,csv('test2.csv') c where a.r=1 and a.c>1 and b.r>1 and a.c=b.c and b.r=c.r and c.c=1 order by c.r,a.c;

(26) By anonymous on 2022-03-20 07:03:11 in reply to 25 [link] [source]

I was curious and tried to give it a try according to your query using my sample CSV file (16 columns and 4800 rows with ‘,’ as delimiter)

Unfortunately, the query does not complete but seems “hangin” there busy without any error.

I tried with 3.38.0 CLI as well as with DB Browser (SQLite 3.35.5) incl. the csv2d.dll as 32- and 64-bit respectively.

Unfortunately, not familiar with C-code. Any ideas?

(27) By anonymous on 2022-03-20 17:54:46 in reply to 26 [link] [source]

Unfortunately, not familiar with C-code. Any ideas?

Not sure what you need this for exactly...

If you want to find a way to "import and unpivot directly from CSV",
then you could use wrapper-implementations which have csv-functionality built-in...
(e.g. compiled into the sqlite3-binary which comes with the wrapper).

Since you're running on Windows, below comes a small VBScript
(which you could put into a *.vbs-File), for "CSV-to-CSV-unpivoting"...

It depends on the COM-wrapper for SQLite, downloadable here:
https://vbRichClient.com/#/en/Downloads.htm

A self-generated test-csv-file (with 16Cols x 4800Rows) was "unpivoted" (into another csv-file) after about 370msec... (in Excel-VBA or VB6 the code would need only half that)

But as said, not sure whether that helps...

Set New_c = CreateObject("RC6.cConstructor")

New_c.Timing True
WScript.Echo UnPivotCsv("c:\temp\crosstab.csv") & New_c.Timing

Function UnPivotCsv(FileName)
  Set M = New_c.MemDB
      M.Exec "Create Virtual Table temp.ct Using csv(filename='" & FileName & "')"
      R = M.GetRs("Select * From ct").GetRows 'get the ResultSet into a 2D-Array
      
  Set S = New_c.StringBuilder
      S.AppendNL """Key"",""Date"",""Value""" 'append the Export-Header
 
  ReDim H(UBound(R, 1))
  For i = 1 To UBound(R, 1): H(i) = """" & EscDQ(R(i, 0)) & """,""": Next
    
  For j = 1 To UBound(R, 2) 'leave out the first record (which contains the headers)
    K = """" & EscDQ(R(0, j)) & ""","
    For i = 1 To UBound(R, 1) 'leave out the first Column (which contains the keys)
       S.AppendNL K & H(i) & EscDQ(R(i, j)) & """"
    Next
  Next
 
  New_c.FSO.WriteTextContent FileName & "_unpivoted.csv", S.ToString, True
  UnPivotCsv = FileName & "_unpivoted.csv" 'return the new FileName
End Function

Function EscDQ(S) 'escape potential single-DoubleQuotes with double-DoubleQuotes
  If InStr(S, """") Then EscDQ = Replace(S, """", """""") Else EscDQ = S
End Function

(29) By anonymous on 2022-03-21 07:58:01 in reply to 27 [link] [source]

Thank you and sorry for the misunderstanding.

I was wondering whether it was expected behavior/performance with the csv2d extension or whether I did miss something e.g. when cross compiling both the 32- and the 64-bit dll on Ubuntu@WSL2 …

.. or whether I miss some settings for better performance with virtual tables .. or the data types used in the CSV are less favorable (key : text, values : real)

on a side note:

I had similar observation using the extension pivot_vtab to create a pivot table from a long table

The query to create the virtual table runs fast but when I do select * FROM pivot (15 columns; approx. 6500 rows); it would take like 10min.? Identical behavior using 32-bit CLI, 64-bit DB Bowser and 64-ODBC from Werner and the respective dlls

(28) By anonymous on 2022-03-20 19:29:45 in reply to 26 [link] [source]

Maybe (although I have not tried it so do not know for sure) you might be able to get better performance by first copying the data to a ordinary b-tree table (with a index on the row/column numbers) before doing the query.