Unstack one column to multiple

(1) By Murtaza (murtazanw) on 2021-07-11 07:28:43 [link] [source]

I have a dataset in a single column table as below;

Hardside Cabin Luggage Trolley Black
senator
View Product Details
N18602991A	KH132-20_BLK	
KH13220BLK
02
Days
04
Hrs
25
Mins
25
Sec
1
1
0


Hardside 4 Wheels Cabin Luggage Trolley Burgundy
senator
View Product Details
N42588980A	KH134-20_BGN	
KH13420BGN
02
Days
04
Hrs
25
Mins
25
Sec
1
1
0


Softside Cabin Luggage Trolley Purple
senator
View Product Details
N32139616A	LL051-20_PRP	
LL05120PRP
02
Days
09
Hrs
25
Mins
25
Sec
1
1
0


Each row represents a value of a column and 2 empty rows mean the row change. The number of fields for each record also varies.

How to unstack and reshape these data from one column to multiple columns and rows into a proper table?

(2) By Ryan Smith (cuz) on 2021-07-11 11:19:50 in reply to 1 [link] [source]

The number of fields for each record also varies.

That's the problem - if we do not have any idea how many rows (fields) there may be per record, there is no way to do what you want.

If the row (field) count was constant per record, the problem would be trivial to solve, since we would know the first row means the description, the second row is the make, then the product detail identifiers etc. etc.

But now we don't know what the rows may be, so it is impossible to do. Unless you have some rule that may help, like "If there are 10 rows exactly then row 5 means X and row 7 means Y etc, and if there are exactly 12 rows, then row 6 is X and row 8 is Y, etc.

Also, if you need to do this one time, I would suggest it is easier to just drop it in a spreadsheet and do it, or write a small bit of code to do it. If this is something that needs importing regularly, then write some more robust code to do it. SQL cannot solve your problem.

(3) By John Dennis (jdennis) on 2021-07-11 12:01:16 in reply to 2 [link] [source]

It would also be useful to know how many records you are needing to process. 100, 1000, 1 million?

(5) By Murtaza (murtazanw) on 2021-07-11 12:32:24 in reply to 3 [link] [source]

Dear John,

They are hardly 1000 rows to process but this is everyday job so need to fix this one.

(4) By Murtaza (murtazanw) on 2021-07-11 12:30:09 in reply to 2 [link] [source]

Dear Ryan, thanks for your reply.

With fixed number of fields, I have already succeeded to unstack and arrange the data into a proper table with sqlite itself using the combination of ROW_NUM() OVER() and MOD() functions.

With varied number of fields I have been doing this for long within Excel with PowerQuery.

Though the number of fields is not fixed but there is a consistency in order of values in each record as you can see just last few values after 'Sec' there is variation in number of values, sometimes they are 3 values, sometimes 4 just before 2 line breaks.

So there is a pattern to follow. Last 1 or 2 values of each record are not so important and can be eliminated to make the fixed number of fields per record.

How about deleting those 1 or 2 extra non blank values matching the MOD() of ROW_NUM() function for the fixed number?

(6) By anonymous on 2021-07-11 12:36:53 in reply to 1 [link] [source]

I would readfile() first the text into a table with a clob field (character large object), so you can use substr() from the Text. Then search for line breaks with a „with recursive“ Statement. That will help to break the Text into records. Use the found size in one line to get the Start for the next. You can use the window clause to get the Last Start and size. Instr() looks for the next cr lf cr lf then as record termination. Finally you can Process each record Looking up keywords. You can get help from replace functions also.

(7) By Simon Slavin (slavin) on 2021-07-11 13:54:08 in reply to 1 [link] [source]

Can't be done inside SQLite. You need to use either a scripting/programming language, or an editor which can count characters, probably one which understands regular expressions.

Of the things you wrote, the most complicated thing to deal with is "The number of fields for each record also varies.". If it wasn't for this, you could do everything you wanted in a spreadsheet program. Actually, even with it you can do what you want in a spreadsheet program which has macros, but if you didn't tumble to that solution already, you probably don't know how.

Sorry, if that file is long enough that you don't want to go through it manually, you're going to need to employ a programmer. That's what they're for.

(8) By MBL (RoboManni) on 2021-07-11 16:29:16 in reply to 7 [link] [source]

$ sqlite3.exe
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> drop table if exists FileBlob;
sqlite> create table if not exists FileBlob( id integer primary key, content text, size integer, separator text );
sqlite> insert or replace into FileBlob(id,content) values( 1, replace(ReadFromFile('D:\PEMS.WIN\RAD\SQLite3x\sandbox\RowsInLines.txt'),X'0d0a',X'0a') );
sqlite> update FileBlob set size=length(Content), separator=cast(X'0a0a' as text) where id==1;

then some basic stuff before I try to go into the details for better understandin how it is working:

sqlite> .mode box
sqlite> with recursive Lines(RowContent,EndPos) as (
   ...>   values( NULL, 1 )  --previous endpos=startpos
   ...>   union
   ...>   select L.EndPos||'+content' as  RowContent
   ...>        ,(L.EndPos + 6) as EndPos  -- new endpos
   ...>     from Lines L join FileBlob B on B.id==1
   ...>    where L.EndPos < length(B.content)
   ...> )
   ...> select * from Lines where RowContent notNull;
┌─────────────┬────────┐
│ RowContent  │ EndPos │
├─────────────┼────────┤
│ 1+content   │ 7      │
│ 7+content   │ 13     │
│ 13+content  │ 19     │
│ 19+content  │ 25     │
│ 25+content  │ 31     │
│ 31+content  │ 37     │
...
│ 469+content │ 475    │
│ 475+content │ 481    │
│ 481+content │ 487    │
└─────────────┴────────┘
sqlite>

now replacing the constant values by results from instr() function using the FileBlob and its content size

sqlite> .mode line
sqlite> with recursive Lines(RowContent,EndPos) as (
   ...>   values( NULL, 1 )  --previous endpos=startpos
   ...>   union
   ...>   select substr(B.content,L.EndPos, instr(substr(B.content,L.EndPos),B.separator) ) as RowContent
   ...>        , L.EndPos + instr(substr(B.content,L.EndPos),B.separator) + 2               as EndPos
   ...>     from Lines L join FileBlob B on B.id==1
   ...>    where L.EndPos between 1 and B.size    -- recursion end condition
   ...> )
   ...> select * from Lines where RowContent notNull;
RowContent = Hardside Cabin Luggage Trolley Black
senator
View Product Details
N18602991A      KH132-20_BLK
KH13220BLK
02
Days
04
Hrs
25
Mins
25
Sec
1
1
0

    EndPos = 141

RowContent = Hardside 4 Wheels Cabin Luggage Trolley Burgundy
senator
View Product Details
N42588980A      KH134-20_BGN
KH13420BGN
02
Days
04
Hrs
25
Mins
25
Sec
1
1
0

    EndPos = 293

RowContent = Softside Cabin Luggage Trolley Purple
senator
View Product Details
N32139616A      LL051-20_PRP
LL05120PRP
02
Days
09
Hrs
25
Mins
25
Sec
1
1
0

    EndPos = 434
sqlite>

or for easier readability with the line breaks in .box mode

sqlite> .mode box
sqlite> with recursive Lines(StartPos,RowContent,EndPos) as (
   ...>   values( NULL, NULL, 1 )  --previous endpos=startpos
   ...>   union
   ...>   select L.EndPos                                                                   as StartPos
   ...>        , substr(B.content,L.EndPos, instr(substr(B.content,L.EndPos),B.separator) ) as RowContent
   ...>        , L.EndPos + instr(substr(B.content,L.EndPos),B.separator) + 2               as EndPos
   ...>     from Lines L join FileBlob B on B.id==1
   ...>    where L.EndPos between 1 and B.size    -- recursion end condition
   ...> )
   ...> select StartPos,replace(RowContent,X'0a',' ')Content,EndPos
   ...>   from Lines
   ...>  where RowContent notNull;
┌──────────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────┐
│ StartPos │                                                                        Content                                                                         │ EndPos │
├──────────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────┤
│ 1        │ Hardside Cabin Luggage Trolley Black senator View Product Details N18602991A       KH132-20_BLK     KH13220BLK 02 Days 04 Hrs 25 Mins 25 Sec 1 1 0              │ 141    │
│ 141      │ Hardside 4 Wheels Cabin Luggage Trolley Burgundy senator View Product Details N42588980A   KH134-20_BGN     KH13420BGN 02 Days 04 Hrs 25 Mins 25 Sec 1 1 0  │ 293    │
│ 293      │ Softside Cabin Luggage Trolley Purple senator View Product Details N32139616A      LL051-20_PRP     LL05120PRP 02 Days 09 Hrs 25 Mins 25 Sec 1 1 0             │ 434    │
└──────────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────┘
sqlite>

So far just to proove that reading blocks of lines can be done with standard SQLite3 with the help of the functions instr() and substr()

Next exercise will be to use the rows and separate the fields out of them.... to be continued

(9) By Simon Slavin (slavin) on 2021-07-12 13:50:32 in reply to 8 [link] [source]

I'm impressed. And will have to learn more about USING.

Part of my post remains: that's a neat piece of programming. To come up with something like the above you need to be a programmer and to think like a programmer.

(10.1) By jake on 2021-07-12 14:22:06 edited from 10.0 in reply to 1 [link] [source]

Here's a solution utilising the pivot_vtab extension:

--
-- Import data
--
.mode csv
CREATE TABLE import(val);
.import tmp.txt import

--
-- Create groups based on the 2 blank rows rule
--
ALTER TABLE import ADD COLUMN g INT;

UPDATE import
   SET g = oid
 WHERE oid = 1
    OR oid IN (SELECT i2.oid+1
                 FROM import i
                 JOIN import i2 ON i.oid = i2.oid-1
                               AND i2.val = ''
                WHERE i.val = '');

UPDATE import
   SET g = i2.g
  FROM import i2
  LEFT JOIN
       import i3 ON import.oid > i3.oid
                AND i2.oid < i3.oid
                AND i3.g IS NOT NULL
 WHERE import.g IS NULL
   AND import.oid > i2.oid
   AND i2.g IS NOT NULL
   AND i3.oid IS NULL;

--
-- Number the columns
--
ALTER TABLE import ADD COLUMN c INT;

UPDATE import
   SET c = x.c
  FROM (SELECT oid, Row_Number() OVER (PARTITION BY g ORDER BY oid) c
          FROM import) x 
 WHERE import.oid = x.oid;

--
-- Pivot
--
.load pivot_vtab
CREATE VIRTUAL TABLE pivot USING pivot_vtab(
  -- Pivot table row query
  (SELECT g
     FROM import
    GROUP BY g),

  -- Pivot table column definition query
  (SELECT c,
          'col' || c
     FROM import
    WHERE c <= (SELECT Max(c)-2 FROM import)
    GROUP BY c
    ORDER BY c),

  -- Pivot query
  (SELECT val
     FROM import 
    WHERE g = ?1
      AND c = ?2)
);

.mode column
SELECT *
  FROM pivot;

g   col1                                              col2     col3                  col4                      col5        col6  col7  col8  col9  col10  col11  col12  col13  col14  col15  col16
--  ------------------------------------------------  -------  --------------------  ------------------------  ----------  ----  ----  ----  ----  -----  -----  -----  -----  -----  -----  -----
1   Hardside Cabin Luggage Trolley Black              senator  View Product Details  N18602991A	KH132-20_BLK	  KH13220BLK  02    Days  04    Hrs   25     Mins   25     Sec    1      1      0
19  Hardside 4 Wheels Cabin Luggage Trolley Burgundy  senator  View Product Details  N42588980A	KH134-20_BGN	  KH13420BGN  02    Days  04    Hrs   25     Mins   25     Sec    1      1      0
37  Softside Cabin Luggage Trolley Purple             senator  View Product Details  N32139616A	LL051-20_PRP	  LL05120PRP  02    Days  09    Hrs   25     Mins   25     Sec    1      1      0

(11) By Murtaza (murtazanw) on 2021-07-14 13:26:27 in reply to 10.1 [source]

I studied thoroughly again the pattern of the stacked data in a single column and found below workaround. I used blank rows as anchor points to encode the record number. I marked preceding and succeeding blank rows in the range the encapsulated values can be marked as one record. 
I have taken the small set of sample data for testing purpose with variable number of fields per record.
Below is the sample data in the file test.csv, first 2 rows are blank and every subset (actual record) is also followed by 2 blank rows.

item11
item12


item21
item22
item23


item31
item32
item33
item34

---- step1 # importing the sample data from the file…
.mode csv
.import test.csv test

---- step2(i) # adding row numbers and then deleting one of 2 blank rows so that every single blank row becomes the starting and ending point of the record. 
DROP TABLE IF EXISTS test1;
CREATE TABLE IF NOT EXISTS test1 AS SELECT *, ROW_NUMBER() OVER() rownum FROM test ;
DELETE FROM test1 WHERE item='' AND MOD(rownum,2)=0;
---- step2(ii) # inserting one blank row at the end for last record ending point and deleting row number column as it has become inconsistent due to deleting blank rows.
INSERT INTO test1(item) VALUES('');
ALTER TABLE test1 DROP rownum;
---- step2(iii) # adding back the row numbers.
DROP TABLE IF EXISTS itemrow;
CREATE TABLE IF NOT EXISTS itemrow AS SELECT *, ROW_NUMBER() OVER() rownum FROM test1 ;
--- result from step 2 looks like below.
.mode csv
.h on
SELECT * FROM itemrow;
item,rownum
"",1
item11,2
item12,3
"",4
item21,5
item22,6
item23,7
"",8
item31,9
item32,10
item33,11
item34,12
"",13
"",14
--- there are 2 blank rows at the end but that hardly affect the process.
---- step3 # creating the range table as a view which consists of 3 columns;
(i)	record number (recno) which is the filtered blank rows because each blank row represents the starting point of a record hence each blank row is a record number.
(ii)	range from (rngfrm) which is the row numbers of all filtered blank rows.
(iii)	range to (rngto) which is the row numbers of all filtered blank rows skipping first blank row because second blank row is the ending point of first record and close the range. Same way for second record, second blank row is starting point and third blank row is ending point. Query in action can explain it well.

DROP VIEW IF EXISTS range;
CREATE VIEW IF NOT EXISTS range AS 
WITH 
range1 AS (SELECT ROW_NUMBER() OVER() recno, rownum rngfrm FROM itemrow WHERE item=''),
range2 AS (SELECT ROW_NUMBER() OVER() recno, rownum rngto FROM itemrow WHERE item='' AND rownum>1)
SELECT recno, rngfrm, rngto FROM range1 INNER JOIN range2 USING(recno);SELECT recno, rngfrm, rngto FROM range1 INNER JOIN range2 USING(recno);
--- here is the look of range view showing record number, its range from and to points.
SELECT * FROM range;
recno,rngfrm,rngto
1,1,4
2,4,8
3,8,13
4,13,14

---- step4 # imposing the range table to the data table to achieve record number and field number of each row.
DROP TABLE IF EXISTS itemrecfld;
CREATE TABLE IF NOT EXISTS itemrecfld AS SELECT item, recno, ROW_NUMBER() OVER(PARTITION BY recno) fldno FROM itemrow JOIN range WHERE rownum BETWEEN rngfrm AND rngto AND item<>'';

--- here looks table with record number and field number.
SELECT * FROM itemrecfld;
item,recno,fldno
item11,1,1
item12,1,2
item21,2,1
item22,2,2
item23,2,3
item31,3,1
item32,3,2
item33,3,3
item34,3,4

---- step5 # final step for pivoting the data based on field number grouped by record number whick gives the desired proper table in multiple colums. However there is no dynamic way to attribute column name and update data into relevant field on the fly, you have to hardcode each column separately which is not a big task. You can hardcode one column one extra based on the maximum of fields. 
SELECT 
		MAX(CASE WHEN fldno=1 THEN item END) col1,
		MAX(CASE WHEN fldno=2 THEN item END) col2,
		MAX(CASE WHEN fldno=3 THEN item END) col3,
		MAX(CASE WHEN fldno=4 THEN item END) col4,
		MAX(CASE WHEN fldno=5 THEN item END) col5,
FROM itemrecfld GROUP BY recno;		
---voila, done. See.
col1,col2,col3,col4,col5,col6
item11,item12,,,,
item21,item22,item23,,,
item31,item32,item33,item34,,