[Feature Request] split string table-valued function

(1) By ddevienne on 2021-11-25 09:49:11 [link] [source]

Now that SQLite support table-valued functions, could we please have split?
And conversely, a join aggregate / window function would also be great, no?
These belong in the SQLite Core IMHO.

And in the mean time, can someone please share how to emulate split with a CTE?

I have an existing DB with a column containing space separate values,
which I need to split / transpose into rows for further processing (joins, etc...)

(2) By Ryan Smith (cuz) on 2021-11-25 10:48:30 in reply to 1 updated by 2.1 [link] [source]

This demonstrates a way to split and flatten separated values in a column.  It uses one column of the input as a key to identify which row the separated values belong to and lists it accordingly.   
Not sure if it is exactly what you needed, but say if not, it should be easy to adapt the format.

```
  -- SQLite version 3.35.4  [ Release: 2021-04-02 ]  on  SQLitespeed version 2.1.3.11.
================================================================================================

-- Make temporary Space-Separated-Value table

DROP TABLE IF EXISTS tmpssv;
CREATE TABLE tmpssv (     
  ID INTEGER PRIMARY KEY,
  keyCol TEXT,
  ssvCol TEXT
);


-- And fill it with some space-separated content

INSERT INTO tmpssv (ID, keyCol, ssvCol) VALUES
    (1, 'foo', '4 66 51 3009 2 678')
  , (2, 'bar', 'Sputnik Discovery')
  , (3, 'baz', '101 I-95 104')
  , (4, 'foz', 'Amsterdam  Beijing  London  Moscow  New York  Paris  Tokyo')
  , (5, 'one', 'John')
;


-- The CTE Query that takes a column (string) and iteratively splits off the first value (up to the first space) until no more text is left in the string.

WITH ssvrec(i, l, c, r) AS (
      SELECT keyCol, 1,
             ssvCol||' ', '' -- Forcing a space at the end of ssvCol removes complicated checking later
        FROM tmpssv
       WHERE 1
    UNION ALL
      SELECT i,
             instr( c, ' ' ) AS vLength,
             substr( c, instr( c, ' ' ) + 1) AS vRemainder,
             trim( substr( c, 1, instr( c, ' ' ) - 1) ) AS vSSV
        FROM ssvrec
       WHERE vLength > 0
    )
  SELECT ID, keyCol, r AS ssvValues
    FROM tmpssv, ssvrec
  WHERE keyCol = i AND r <> ''
  ORDER BY ID, keyCol
;

-- I've used compact notation to avoid long wordy SQL, but the short column names basically represent:
--   i = Index of unpacked row
--   l = Length of text remaining to be unpacked
--   c = Column text remaining to be unpacked
--   r = Resulting current unpacked (split-off) ssv value


-- The Resulting split/unpacked values:

  --  ID|keyCol|ssvValues  
  -- ---|------|-----------
  --  1 |  foo |2          
  --  1 |  foo |3009       
  --  1 |  foo |4          
  --  1 |  foo |51         
  --  1 |  foo |66         
  --  1 |  foo |678        
  --  2 |  bar |Discovery  
  --  2 |  bar |Sputnik    
  --  3 |  baz |101        
  --  3 |  baz |104        
  --  3 |  baz |I-95       
  --  4 |  foz |Amsterdam  
  --  4 |  foz |Beijing    
  --  4 |  foz |London     
  --  4 |  foz |Moscow     
  --  4 |  foz |New        
  --  4 |  foz |Paris      
  --  4 |  foz |Tokyo      
  --  4 |  foz |York       
  --  5 |  one |John       


-- Cleanup

DROP TABLE IF EXISTS tmpssv;

```

(2.1) By Ryan Smith (cuz) on 2021-11-25 10:51:01 edited from 2.0 in reply to 1 [link] [source]

This demonstrates a way to split and flatten separated values in a column. It uses one column of the input as a key to identify which row the separated values belong to and lists it accordingly.
Not sure if it is exactly what you needed, but say if not, it should be easy to adapt the format.

  -- SQLite version 3.35.4  [ Release: 2021-04-02 ]  on  SQLitespeed version 2.1.3.11.
================================================================================================

-- Make temporary Space-Separated-Value table

DROP TABLE IF EXISTS tmpssv;
CREATE TABLE tmpssv (     
  ID INTEGER PRIMARY KEY,
  keyCol TEXT,
  ssvCol TEXT
);


-- And fill it with some space-separated content

INSERT INTO tmpssv (ID, keyCol, ssvCol) VALUES
    (1, 'foo', '4 66 51 3009 2 678')
  , (2, 'bar', 'Sputnik Discovery')
  , (3, 'baz', '101 I-95 104')
  , (4, 'foz', 'Amsterdam  Beijing  London  Moscow  New York  Paris  Tokyo')
  , (5, 'one', 'John')
;


-- The CTE Query that takes a column (string) and iteratively splits off the first value (up to the first space) until no more text is left in the string.

WITH ssvrec(i, l, c, r) AS (
      SELECT keyCol, 1,
             ssvCol||' ', '' -- Forcing a space at the end of ssvCol removes complicated checking later
        FROM tmpssv
       WHERE 1
    UNION ALL
      SELECT i,
             instr( c, ' ' ) AS vLength,
             substr( c, instr( c, ' ' ) + 1) AS vRemainder,
             trim( substr( c, 1, instr( c, ' ' ) - 1) ) AS vSSV
        FROM ssvrec
       WHERE vLength > 0
    )
  SELECT ID, keyCol, r AS ssvValues
    FROM tmpssv, ssvrec
  WHERE keyCol = i AND r <> ''
  ORDER BY ID, keyCol
;

-- I've used compact notation to avoid long wordy SQL, but the short column names basically represent:
--   i = Key of unpacked row
--   l = Length of text remaining to be unpacked
--   c = Column text remaining to be unpacked
--   r = Resulting current unpacked (split-off) ssv value


-- The Resulting split/unpacked values:

  --  ID|keyCol|ssvValues  
  -- ---|------|-----------
  --  1 |  foo |2          
  --  1 |  foo |3009       
  --  1 |  foo |4          
  --  1 |  foo |51         
  --  1 |  foo |66         
  --  1 |  foo |678        
  --  2 |  bar |Discovery  
  --  2 |  bar |Sputnik    
  --  3 |  baz |101        
  --  3 |  baz |104        
  --  3 |  baz |I-95       
  --  4 |  foz |Amsterdam  
  --  4 |  foz |Beijing    
  --  4 |  foz |London     
  --  4 |  foz |Moscow     
  --  4 |  foz |New        
  --  4 |  foz |Paris      
  --  4 |  foz |Tokyo      
  --  4 |  foz |York       
  --  5 |  one |John       


-- Cleanup

DROP TABLE IF EXISTS tmpssv;

(3) By ddevienne on 2021-11-25 13:55:29 in reply to 2.1 [link] [source]

Thank you Ryan!

Apart from some renaming, for clarity;
adding the index in the space-separated-list;
and not having your vLength column, this is basically your code.

create view grids(parent, files) as ...

with
grid (parent, head, tail, idx) as
(
    select parent, '', files||' ', 0
      from grids
     UNION ALL
    select parent,
           trim(substr(tail, 1, instr(tail, ' ')-1)) as head,
           substr( tail, instr( tail, ' ' ) + 1) as tail,
           idx + 1 as idx
      from grid
     where instr(tail, ' ') > 0
)
select distinct parent, head as file, idx
  from grid
 where length(trim(head)) > 0
 order by parent, idx, file

(4) By Ryan Smith (cuz) on 2021-11-25 15:28:32 in reply to 3 [link] [source]

Good job!

Fair warning (as you probably already know) - That will be quite slow for larger datasets. It has the Advantage/Disadvantage of causing multiple output rows which a simple UDF cannot easily do.

If a "split()" function did exist (even as a UDF), we could get rid of a lot of the fluff in that CTE and speed it up significantly.

General musings:
I wonder if there is a precedent in Postgres or even MySQL or such, I don't recall ever using/finding such a function in any SQL, which probably suggests it's not universally needed, but might be useful still. Will need to be Unicode-aware I suppose.

(5) By ddevienne on 2021-11-25 16:00:30 in reply to 4 [link] [source]

That will be quite slow for larger datasets

It takes milliseconds in my case, no biggy. Not a lot of data.

If a "split()" ... did exist .. could .. speed it up significantly

Indeed. And that's precisely why I requested it!!!

Using the CTE is jumping to significant hoops to split a string into rows IMHO.
Yes, such data is denormalized, but unfortunately that's very common...

if there is a precedent in Postgres

There is, via unnest(string_to_array(col, sep))
I.e. first to an array (I wish SQLite had native arrays like PostgreSQL :)),
then one uses the unnest table-valued function (RETURNS SETOF).

(6) By ddevienne on 2021-11-25 16:03:41 in reply to 4 [link] [source]

Will need to be Unicode-aware I suppose

Not really. Both sides (string and separator) are strings, so unicode already.
So already encoded to some bytes. Just match the bytes.

The real design decision is whether the separator is a set of chars,
or a text-fragment; and whether to compress / ignore repeated separator.
Ideally those are options of split().

(7) By ddevienne on 2021-11-26 15:00:49 in reply to 4 [source]

If a "split()" function did exist

To wrap-up this thread, and hopefully convince Richard and team to add a split
table-valued function to SQLite Core, I'd like the illustrate the stark difference
in SQL necessary w/ and w/o such a function.

So I created two concrete tables, with the same data, one with the original
space-separated values, and the other using JSON. And to build the JSON-formatted
one, I actually used the grid view using a CTE that Ryan helped me define:

create table grids_ssv as
select parent, files from grids

create table grids_json as
select parent, '["'||group_concat(file, '","')||'"]' as files
  from (select parent, file, idx from grid order by parent, idx)
 group by parent

Here the simple SQL using JSON1:

select g.parent, j.id, j.value
  from grids_json g, json_each(files) j
 order by g.parent, j.id

And here's the complex CTE from Ryan and myself:

with
grid (parent, head, tail, idx) as
(
    select parent, '', files||' ', 0
      from grids_ssv
     UNION ALL
    select parent,
           trim(substr(tail, 1, instr(tail, ' ')-1)) as head,
           substr( tail, instr( tail, ' ' ) + 1) as tail,
           idx + 1 as idx
      from grid
     where instr(tail, ' ') > 0
)
select distinct parent, head as file, idx
  from grid
 where length(trim(head)) > 0
 order by parent, idx

There's no contest IMHO. If this doesn't convince anything this belongs in SQLite's built-in functions,
I don't know what will basically.

In my case, I don't have enough data to contrast performance. 21 rows expand into 82 rows.
Both queries range from 1ms to 4ms in SQLiteSpy, depending on the run. But we mostly all
agree the table-valued one will fair better for larger data, especially larger strings to split.

On the join side, that's a keyword, and group_concat already does that basically,
so I don't know what I was thinking when I wrote my original feature request ;)

So Richard, could we please have something like split? Xmas is not far :) --DD

(8) By ET (EricTsau) on 2021-11-28 07:28:10 in reply to 1 [link] [source]

You can adapt the answer to this question to your needs. https://stackoverflow.com/questions/24258878/how-to-split-comma-separated-value-in-sqlite/32051164#32051164

WITH split(word, str) AS (
    -- alternatively put your query here
    -- SELECT '', category||' ' FROM categories
    SELECT '', 'Auto A 1234444'||' '
    UNION ALL SELECT
    substr(str, 0, instr(str, ' ')),
    substr(str, instr(str, ' ')+1)
    FROM split WHERE str!=''
) SELECT word FROM split WHERE word!='';

Output is as expected:

Auto
A
1234444

(9) By SeaWolf (LeeBorgea) on 2022-04-17 02:06:51 in reply to 2.1 [link] [source]

Kudos for providing this example, perfect for the use case I have.

Many thanks !