[Feature Request] split string table-valued function
(1) By ddevienne on 2021-11-25 09:49:11 [link] [source]
Now that SQLite support table-valued functions, could we please have split
?
And conversely, a join
aggregate / window function would also be great, no?
These belong in the SQLite Core IMHO.
And in the mean time, can someone please share how to emulate split
with a CTE?
I have an existing DB with a column containing space separate values,
which I need to split / transpose into rows for further processing (joins, etc...)
(2) By Ryan Smith (cuz) on 2021-11-25 10:48:30 in reply to 1 updated by 2.1 [link] [source]
This demonstrates a way to split and flatten separated values in a column. It uses one column of the input as a key to identify which row the separated values belong to and lists it accordingly. Not sure if it is exactly what you needed, but say if not, it should be easy to adapt the format. ``` -- SQLite version 3.35.4 [ Release: 2021-04-02 ] on SQLitespeed version 2.1.3.11. ================================================================================================ -- Make temporary Space-Separated-Value table DROP TABLE IF EXISTS tmpssv; CREATE TABLE tmpssv ( ID INTEGER PRIMARY KEY, keyCol TEXT, ssvCol TEXT ); -- And fill it with some space-separated content INSERT INTO tmpssv (ID, keyCol, ssvCol) VALUES (1, 'foo', '4 66 51 3009 2 678') , (2, 'bar', 'Sputnik Discovery') , (3, 'baz', '101 I-95 104') , (4, 'foz', 'Amsterdam Beijing London Moscow New York Paris Tokyo') , (5, 'one', 'John') ; -- The CTE Query that takes a column (string) and iteratively splits off the first value (up to the first space) until no more text is left in the string. WITH ssvrec(i, l, c, r) AS ( SELECT keyCol, 1, ssvCol||' ', '' -- Forcing a space at the end of ssvCol removes complicated checking later FROM tmpssv WHERE 1 UNION ALL SELECT i, instr( c, ' ' ) AS vLength, substr( c, instr( c, ' ' ) + 1) AS vRemainder, trim( substr( c, 1, instr( c, ' ' ) - 1) ) AS vSSV FROM ssvrec WHERE vLength > 0 ) SELECT ID, keyCol, r AS ssvValues FROM tmpssv, ssvrec WHERE keyCol = i AND r <> '' ORDER BY ID, keyCol ; -- I've used compact notation to avoid long wordy SQL, but the short column names basically represent: -- i = Index of unpacked row -- l = Length of text remaining to be unpacked -- c = Column text remaining to be unpacked -- r = Resulting current unpacked (split-off) ssv value -- The Resulting split/unpacked values: -- ID|keyCol|ssvValues -- ---|------|----------- -- 1 | foo |2 -- 1 | foo |3009 -- 1 | foo |4 -- 1 | foo |51 -- 1 | foo |66 -- 1 | foo |678 -- 2 | bar |Discovery -- 2 | bar |Sputnik -- 3 | baz |101 -- 3 | baz |104 -- 3 | baz |I-95 -- 4 | foz |Amsterdam -- 4 | foz |Beijing -- 4 | foz |London -- 4 | foz |Moscow -- 4 | foz |New -- 4 | foz |Paris -- 4 | foz |Tokyo -- 4 | foz |York -- 5 | one |John -- Cleanup DROP TABLE IF EXISTS tmpssv; ```
(2.1) By Ryan Smith (cuz) on 2021-11-25 10:51:01 edited from 2.0 in reply to 1 [link] [source]
This demonstrates a way to split and flatten separated values in a column. It uses one column of the input as a key to identify which row the separated values belong to and lists it accordingly.
Not sure if it is exactly what you needed, but say if not, it should be easy to adapt the format.
-- SQLite version 3.35.4 [ Release: 2021-04-02 ] on SQLitespeed version 2.1.3.11.
================================================================================================
-- Make temporary Space-Separated-Value table
DROP TABLE IF EXISTS tmpssv;
CREATE TABLE tmpssv (
ID INTEGER PRIMARY KEY,
keyCol TEXT,
ssvCol TEXT
);
-- And fill it with some space-separated content
INSERT INTO tmpssv (ID, keyCol, ssvCol) VALUES
(1, 'foo', '4 66 51 3009 2 678')
, (2, 'bar', 'Sputnik Discovery')
, (3, 'baz', '101 I-95 104')
, (4, 'foz', 'Amsterdam Beijing London Moscow New York Paris Tokyo')
, (5, 'one', 'John')
;
-- The CTE Query that takes a column (string) and iteratively splits off the first value (up to the first space) until no more text is left in the string.
WITH ssvrec(i, l, c, r) AS (
SELECT keyCol, 1,
ssvCol||' ', '' -- Forcing a space at the end of ssvCol removes complicated checking later
FROM tmpssv
WHERE 1
UNION ALL
SELECT i,
instr( c, ' ' ) AS vLength,
substr( c, instr( c, ' ' ) + 1) AS vRemainder,
trim( substr( c, 1, instr( c, ' ' ) - 1) ) AS vSSV
FROM ssvrec
WHERE vLength > 0
)
SELECT ID, keyCol, r AS ssvValues
FROM tmpssv, ssvrec
WHERE keyCol = i AND r <> ''
ORDER BY ID, keyCol
;
-- I've used compact notation to avoid long wordy SQL, but the short column names basically represent:
-- i = Key of unpacked row
-- l = Length of text remaining to be unpacked
-- c = Column text remaining to be unpacked
-- r = Resulting current unpacked (split-off) ssv value
-- The Resulting split/unpacked values:
-- ID|keyCol|ssvValues
-- ---|------|-----------
-- 1 | foo |2
-- 1 | foo |3009
-- 1 | foo |4
-- 1 | foo |51
-- 1 | foo |66
-- 1 | foo |678
-- 2 | bar |Discovery
-- 2 | bar |Sputnik
-- 3 | baz |101
-- 3 | baz |104
-- 3 | baz |I-95
-- 4 | foz |Amsterdam
-- 4 | foz |Beijing
-- 4 | foz |London
-- 4 | foz |Moscow
-- 4 | foz |New
-- 4 | foz |Paris
-- 4 | foz |Tokyo
-- 4 | foz |York
-- 5 | one |John
-- Cleanup
DROP TABLE IF EXISTS tmpssv;
(3) By ddevienne on 2021-11-25 13:55:29 in reply to 2.1 [link] [source]
Thank you Ryan!
Apart from some renaming, for clarity;
adding the index in the space-separated-list;
and not having your vLength
column, this is basically your code.
create view grids(parent, files) as ...
with
grid (parent, head, tail, idx) as
(
select parent, '', files||' ', 0
from grids
UNION ALL
select parent,
trim(substr(tail, 1, instr(tail, ' ')-1)) as head,
substr( tail, instr( tail, ' ' ) + 1) as tail,
idx + 1 as idx
from grid
where instr(tail, ' ') > 0
)
select distinct parent, head as file, idx
from grid
where length(trim(head)) > 0
order by parent, idx, file
(4) By Ryan Smith (cuz) on 2021-11-25 15:28:32 in reply to 3 [link] [source]
Good job!
Fair warning (as you probably already know) - That will be quite slow for larger datasets. It has the Advantage/Disadvantage of causing multiple output rows which a simple UDF cannot easily do.
If a "split()" function did exist (even as a UDF), we could get rid of a lot of the fluff in that CTE and speed it up significantly.
General musings:
I wonder if there is a precedent in Postgres or even MySQL or such, I don't recall ever using/finding such a function in any SQL, which probably suggests it's not universally needed, but might be useful still. Will need to be Unicode-aware I suppose.
(5) By ddevienne on 2021-11-25 16:00:30 in reply to 4 [link] [source]
That will be quite slow for larger datasets
It takes milliseconds in my case, no biggy. Not a lot of data.
If a "split()" ... did exist .. could .. speed it up significantly
Indeed. And that's precisely why I requested it!!!
Using the CTE is jumping to significant hoops to split a string into rows IMHO.
Yes, such data is denormalized, but unfortunately that's very common...
if there is a precedent in Postgres
There is, via unnest(string_to_array(col, sep))
I.e. first to an array (I wish SQLite had native arrays like PostgreSQL :)),
then one uses the unnest
table-valued function (RETURNS SETOF
).
(6) By ddevienne on 2021-11-25 16:03:41 in reply to 4 [link] [source]
Will need to be Unicode-aware I suppose
Not really. Both sides (string and separator) are strings, so unicode already.
So already encoded to some bytes. Just match the bytes.
The real design decision is whether the separator is a set of chars,
or a text-fragment; and whether to compress / ignore repeated separator.
Ideally those are options of split()
.
(7) By ddevienne on 2021-11-26 15:00:49 in reply to 4 [source]
If a "split()" function did exist
To wrap-up this thread, and hopefully convince Richard and team to add a split
table-valued function to SQLite Core, I'd like the illustrate the stark difference
in SQL necessary w/ and w/o such a function.
So I created two concrete tables, with the same data, one with the original
space-separated values, and the other using JSON. And to build the JSON-formatted
one, I actually used the grid
view using a CTE that Ryan helped me define:
create table grids_ssv as
select parent, files from grids
create table grids_json as
select parent, '["'||group_concat(file, '","')||'"]' as files
from (select parent, file, idx from grid order by parent, idx)
group by parent
Here the simple SQL using JSON1:
select g.parent, j.id, j.value
from grids_json g, json_each(files) j
order by g.parent, j.id
And here's the complex CTE from Ryan and myself:
with
grid (parent, head, tail, idx) as
(
select parent, '', files||' ', 0
from grids_ssv
UNION ALL
select parent,
trim(substr(tail, 1, instr(tail, ' ')-1)) as head,
substr( tail, instr( tail, ' ' ) + 1) as tail,
idx + 1 as idx
from grid
where instr(tail, ' ') > 0
)
select distinct parent, head as file, idx
from grid
where length(trim(head)) > 0
order by parent, idx
There's no contest IMHO. If this doesn't convince anything this belongs in SQLite's built-in functions,
I don't know what will basically.
In my case, I don't have enough data to contrast performance. 21 rows expand into 82 rows.
Both queries range from 1ms to 4ms in SQLiteSpy, depending on the run. But we mostly all
agree the table-valued one will fair better for larger data, especially larger strings to split.
On the join
side, that's a keyword, and group_concat
already does that basically,
so I don't know what I was thinking when I wrote my original feature request ;)
So Richard, could we please have something like split
? Xmas is not far :) --DD
(8) By ET (EricTsau) on 2021-11-28 07:28:10 in reply to 1 [link] [source]
You can adapt the answer to this question to your needs. https://stackoverflow.com/questions/24258878/how-to-split-comma-separated-value-in-sqlite/32051164#32051164
WITH split(word, str) AS (
-- alternatively put your query here
-- SELECT '', category||' ' FROM categories
SELECT '', 'Auto A 1234444'||' '
UNION ALL SELECT
substr(str, 0, instr(str, ' ')),
substr(str, instr(str, ' ')+1)
FROM split WHERE str!=''
) SELECT word FROM split WHERE word!='';
Output is as expected:
Auto
A
1234444
(9) By SeaWolf (LeeBorgea) on 2022-04-17 02:06:51 in reply to 2.1 [link] [source]
Kudos for providing this example, perfect for the use case I have.
Many thanks !