[Feature Request] split string table-valued function
(1) By ddevienne on 2021-11-25 09:49:11 [link]
Now that SQLite support table-valued functions, could we please have `split`? And conversely, a `join` aggregate / window function would also be great, no? These belong in the SQLite Core IMHO. And in the mean time, can someone please share how to emulate `split` with a CTE? I have an existing DB with a column containing space separate values, which I need to split / transpose into rows for further processing (joins, etc...)
(2.1) By Ryan Smith (cuz) on 2021-11-25 10:51:01 edited from 2.0 in reply to 1 [link]
This demonstrates a way to split and flatten separated values in a column. It uses one column of the input as a key to identify which row the separated values belong to and lists it accordingly. Not sure if it is exactly what you needed, but say if not, it should be easy to adapt the format. ``` -- SQLite version 3.35.4 [ Release: 2021-04-02 ] on SQLitespeed version 18.104.22.168. ================================================================================================ -- Make temporary Space-Separated-Value table DROP TABLE IF EXISTS tmpssv; CREATE TABLE tmpssv ( ID INTEGER PRIMARY KEY, keyCol TEXT, ssvCol TEXT ); -- And fill it with some space-separated content INSERT INTO tmpssv (ID, keyCol, ssvCol) VALUES (1, 'foo', '4 66 51 3009 2 678') , (2, 'bar', 'Sputnik Discovery') , (3, 'baz', '101 I-95 104') , (4, 'foz', 'Amsterdam Beijing London Moscow New York Paris Tokyo') , (5, 'one', 'John') ; -- The CTE Query that takes a column (string) and iteratively splits off the first value (up to the first space) until no more text is left in the string. WITH ssvrec(i, l, c, r) AS ( SELECT keyCol, 1, ssvCol||' ', '' -- Forcing a space at the end of ssvCol removes complicated checking later FROM tmpssv WHERE 1 UNION ALL SELECT i, instr( c, ' ' ) AS vLength, substr( c, instr( c, ' ' ) + 1) AS vRemainder, trim( substr( c, 1, instr( c, ' ' ) - 1) ) AS vSSV FROM ssvrec WHERE vLength > 0 ) SELECT ID, keyCol, r AS ssvValues FROM tmpssv, ssvrec WHERE keyCol = i AND r <> '' ORDER BY ID, keyCol ; -- I've used compact notation to avoid long wordy SQL, but the short column names basically represent: -- i = Key of unpacked row -- l = Length of text remaining to be unpacked -- c = Column text remaining to be unpacked -- r = Resulting current unpacked (split-off) ssv value -- The Resulting split/unpacked values: -- ID|keyCol|ssvValues -- ---|------|----------- -- 1 | foo |2 -- 1 | foo |3009 -- 1 | foo |4 -- 1 | foo |51 -- 1 | foo |66 -- 1 | foo |678 -- 2 | bar |Discovery -- 2 | bar |Sputnik -- 3 | baz |101 -- 3 | baz |104 -- 3 | baz |I-95 -- 4 | foz |Amsterdam -- 4 | foz |Beijing -- 4 | foz |London -- 4 | foz |Moscow -- 4 | foz |New -- 4 | foz |Paris -- 4 | foz |Tokyo -- 4 | foz |York -- 5 | one |John -- Cleanup DROP TABLE IF EXISTS tmpssv; ```
(3) By ddevienne on 2021-11-25 13:55:29 in reply to 2.1 [link]
Thank you Ryan! Apart from some renaming, for clarity; adding the index in the space-separated-list; and not having your `vLength` column, this is basically your code. ``` create view grids(parent, files) as ... with grid (parent, head, tail, idx) as ( select parent, '', files||' ', 0 from grids UNION ALL select parent, trim(substr(tail, 1, instr(tail, ' ')-1)) as head, substr( tail, instr( tail, ' ' ) + 1) as tail, idx + 1 as idx from grid where instr(tail, ' ') > 0 ) select distinct parent, head as file, idx from grid where length(trim(head)) > 0 order by parent, idx, file ```
(4) By Ryan Smith (cuz) on 2021-11-25 15:28:32 in reply to 3 [link]
Good job! Fair warning (as you probably already know) - That will be quite slow for larger datasets. It has the Advantage/Disadvantage of causing multiple output rows which a simple UDF cannot easily do. If a "split()" function did exist (even as a UDF), we could get rid of a lot of the fluff in that CTE and speed it up significantly. General musings: I wonder if there is a precedent in Postgres or even MySQL or such, I don't recall ever using/finding such a function in any SQL, which probably suggests it's not universally needed, but might be useful still. Will need to be Unicode-aware I suppose.
(5) By ddevienne on 2021-11-25 16:00:30 in reply to 4 [link]
> That will be quite slow for larger datasets It takes milliseconds in my case, no biggy. Not a lot of data. > If a "split()" ... did exist .. could .. speed it up significantly Indeed. And that's precisely why I requested it!!! Using the CTE is jumping to significant hoops to split a string into rows IMHO. Yes, such data is denormalized, but unfortunately that's very common... > if there is a precedent in Postgres There is, via [`unnest(string_to_array(col, sep))`][SO] I.e. first to an array (I wish SQLite had native arrays like PostgreSQL :)), then one uses the [`unnest`][UN] table-valued function (`RETURNS SETOF`). [UN]: https://www.postgresql.org/docs/current/functions-array.html [SO]: https://stackoverflow.com/questions/29419993/split-column-into-multiple-rows-in-postgres
(6) By ddevienne on 2021-11-25 16:03:41 in reply to 4 [link]
> Will need to be Unicode-aware I suppose Not really. Both sides (string and separator) are strings, so *unicode* already. So already encoded to some bytes. Just match the bytes. The real design decision is whether the *separator* is a set of chars, or a text-fragment; and whether to *compress* / *ignore* repeated *separator*. Ideally those are options of `split()`.
(7) By ddevienne on 2021-11-26 15:00:49 in reply to 4
> If a "split()" function did exist To *wrap-up* this thread, and hopefully convince Richard and team to add a `split` table-valued function to SQLite Core, I'd like the illustrate the stark difference in SQL necessary w/ and w/o such a function. So I created two concrete tables, with the same data, one with the original space-separated values, and the other using JSON. And to build the JSON-formatted one, I actually used the `grid` view using a CTE that Ryan helped me define: ``` create table grids_ssv as select parent, files from grids create table grids_json as select parent, '["'||group_concat(file, '","')||'"]' as files from (select parent, file, idx from grid order by parent, idx) group by parent ``` Here the **simple** SQL using JSON1: ``` select g.parent, j.id, j.value from grids_json g, json_each(files) j order by g.parent, j.id ``` And here's the **complex** CTE from Ryan and myself: ``` with grid (parent, head, tail, idx) as ( select parent, '', files||' ', 0 from grids_ssv UNION ALL select parent, trim(substr(tail, 1, instr(tail, ' ')-1)) as head, substr( tail, instr( tail, ' ' ) + 1) as tail, idx + 1 as idx from grid where instr(tail, ' ') > 0 ) select distinct parent, head as file, idx from grid where length(trim(head)) > 0 order by parent, idx ``` There's no contest IMHO. If this doesn't convince anything this belongs in SQLite's built-in functions, I don't know what will basically. In my case, I don't have enough data to contrast performance. 21 rows expand into 82 rows. Both queries range from 1ms to 4ms in SQLiteSpy, depending on the run. But we mostly all agree the table-valued one will fair better for larger data, especially larger strings to split. On the `join` side, that's a *keyword*, and `group_concat` already does that basically, so I don't know what I was thinking when I wrote my original feature request ;) So Richard, could we please have something like `split`? Xmas is not far :) --DD
(9) By SeaWolf (LeeBorgea) on 2022-04-17 02:06:51 in reply to 2.1 [link]
Kudos for providing this example, perfect for the use case I have. Many thanks !
(8) By ET (EricTsau) on 2021-11-28 07:28:10 in reply to 1 [link]
You can adapt the answer to this question to your needs. <https://stackoverflow.com/questions/24258878/how-to-split-comma-separated-value-in-sqlite/32051164#32051164> ``` WITH split(word, str) AS ( -- alternatively put your query here -- SELECT '', category||' ' FROM categories SELECT '', 'Auto A 1234444'||' ' UNION ALL SELECT substr(str, 0, instr(str, ' ')), substr(str, instr(str, ' ')+1) FROM split WHERE str!='' ) SELECT word FROM split WHERE word!=''; ``` Output is as expected: ``` Auto A 1234444```