[Feature Request] split string table-valued function
Now that SQLite support table-valued functions, could we please have
And conversely, a
join aggregate / window function would also be great, no?
These belong in the SQLite Core IMHO.
And in the mean time, can someone please share how to emulate
split with a CTE?
I have an existing DB with a column containing space separate values,
which I need to split / transpose into rows for further processing (joins, etc...)
This demonstrates a way to split and flatten separated values in a column. It uses one column of the input as a key to identify which row the separated values belong to and lists it accordingly.
Not sure if it is exactly what you needed, but say if not, it should be easy to adapt the format.
-- SQLite version 3.35.4 [ Release: 2021-04-02 ] on SQLitespeed version 184.108.40.206. ================================================================================================ -- Make temporary Space-Separated-Value table DROP TABLE IF EXISTS tmpssv; CREATE TABLE tmpssv ( ID INTEGER PRIMARY KEY, keyCol TEXT, ssvCol TEXT ); -- And fill it with some space-separated content INSERT INTO tmpssv (ID, keyCol, ssvCol) VALUES (1, 'foo', '4 66 51 3009 2 678') , (2, 'bar', 'Sputnik Discovery') , (3, 'baz', '101 I-95 104') , (4, 'foz', 'Amsterdam Beijing London Moscow New York Paris Tokyo') , (5, 'one', 'John') ; -- The CTE Query that takes a column (string) and iteratively splits off the first value (up to the first space) until no more text is left in the string. WITH ssvrec(i, l, c, r) AS ( SELECT keyCol, 1, ssvCol||' ', '' -- Forcing a space at the end of ssvCol removes complicated checking later FROM tmpssv WHERE 1 UNION ALL SELECT i, instr( c, ' ' ) AS vLength, substr( c, instr( c, ' ' ) + 1) AS vRemainder, trim( substr( c, 1, instr( c, ' ' ) - 1) ) AS vSSV FROM ssvrec WHERE vLength > 0 ) SELECT ID, keyCol, r AS ssvValues FROM tmpssv, ssvrec WHERE keyCol = i AND r <> '' ORDER BY ID, keyCol ; -- I've used compact notation to avoid long wordy SQL, but the short column names basically represent: -- i = Key of unpacked row -- l = Length of text remaining to be unpacked -- c = Column text remaining to be unpacked -- r = Resulting current unpacked (split-off) ssv value -- The Resulting split/unpacked values: -- ID|keyCol|ssvValues -- ---|------|----------- -- 1 | foo |2 -- 1 | foo |3009 -- 1 | foo |4 -- 1 | foo |51 -- 1 | foo |66 -- 1 | foo |678 -- 2 | bar |Discovery -- 2 | bar |Sputnik -- 3 | baz |101 -- 3 | baz |104 -- 3 | baz |I-95 -- 4 | foz |Amsterdam -- 4 | foz |Beijing -- 4 | foz |London -- 4 | foz |Moscow -- 4 | foz |New -- 4 | foz |Paris -- 4 | foz |Tokyo -- 4 | foz |York -- 5 | one |John -- Cleanup DROP TABLE IF EXISTS tmpssv;
Thank you Ryan!
Apart from some renaming, for clarity;
adding the index in the space-separated-list;
and not having your
vLength column, this is basically your code.
create view grids(parent, files) as ... with grid (parent, head, tail, idx) as ( select parent, '', files||' ', 0 from grids UNION ALL select parent, trim(substr(tail, 1, instr(tail, ' ')-1)) as head, substr( tail, instr( tail, ' ' ) + 1) as tail, idx + 1 as idx from grid where instr(tail, ' ') > 0 ) select distinct parent, head as file, idx from grid where length(trim(head)) > 0 order by parent, idx, file
Fair warning (as you probably already know) - That will be quite slow for larger datasets. It has the Advantage/Disadvantage of causing multiple output rows which a simple UDF cannot easily do.
If a "split()" function did exist (even as a UDF), we could get rid of a lot of the fluff in that CTE and speed it up significantly.
I wonder if there is a precedent in Postgres or even MySQL or such, I don't recall ever using/finding such a function in any SQL, which probably suggests it's not universally needed, but might be useful still. Will need to be Unicode-aware I suppose.
That will be quite slow for larger datasets
It takes milliseconds in my case, no biggy. Not a lot of data.
If a "split()" ... did exist .. could .. speed it up significantly
Indeed. And that's precisely why I requested it!!!
Using the CTE is jumping to significant hoops to split a string into rows IMHO.
Yes, such data is denormalized, but unfortunately that's very common...
if there is a precedent in Postgres
Will need to be Unicode-aware I suppose
Not really. Both sides (string and separator) are strings, so unicode already.
So already encoded to some bytes. Just match the bytes.
The real design decision is whether the separator is a set of chars,
or a text-fragment; and whether to compress / ignore repeated separator.
Ideally those are options of
If a "split()" function did exist
To wrap-up this thread, and hopefully convince Richard and team to add a
table-valued function to SQLite Core, I'd like the illustrate the stark difference
in SQL necessary w/ and w/o such a function.
So I created two concrete tables, with the same data, one with the original
space-separated values, and the other using JSON. And to build the JSON-formatted
one, I actually used the
grid view using a CTE that Ryan helped me define:
create table grids_ssv as select parent, files from grids create table grids_json as select parent, '["'||group_concat(file, '","')||'"]' as files from (select parent, file, idx from grid order by parent, idx) group by parent
Here the simple SQL using JSON1:
select g.parent, j.id, j.value from grids_json g, json_each(files) j order by g.parent, j.id
And here's the complex CTE from Ryan and myself:
with grid (parent, head, tail, idx) as ( select parent, '', files||' ', 0 from grids_ssv UNION ALL select parent, trim(substr(tail, 1, instr(tail, ' ')-1)) as head, substr( tail, instr( tail, ' ' ) + 1) as tail, idx + 1 as idx from grid where instr(tail, ' ') > 0 ) select distinct parent, head as file, idx from grid where length(trim(head)) > 0 order by parent, idx
There's no contest IMHO. If this doesn't convince anything this belongs in SQLite's built-in functions,
I don't know what will basically.
In my case, I don't have enough data to contrast performance. 21 rows expand into 82 rows.
Both queries range from 1ms to 4ms in SQLiteSpy, depending on the run. But we mostly all
agree the table-valued one will fair better for larger data, especially larger strings to split.
join side, that's a keyword, and
group_concat already does that basically,
so I don't know what I was thinking when I wrote my original feature request ;)
So Richard, could we please have something like
split? Xmas is not far :) --DD
You can adapt the answer to this question to your needs. https://stackoverflow.com/questions/24258878/how-to-split-comma-separated-value-in-sqlite/32051164#32051164
WITH split(word, str) AS ( -- alternatively put your query here -- SELECT '', category||' ' FROM categories SELECT '', 'Auto A 1234444'||' ' UNION ALL SELECT substr(str, 0, instr(str, ' ')), substr(str, instr(str, ' ')+1) FROM split WHERE str!='' ) SELECT word FROM split WHERE word!='';
Output is as expected:
Auto A 1234444