Add split function

(1) By yaugenka on 2022-05-20 06:08:08 [link] [source]

CSV in text field is one of the best alternative to the missing array type. There is group_concat aggregate but no split counterpart. It would be nice to implement one to unnest values into rows.

(2.1) By ddevienne on 2022-05-20 07:09:43 edited from 2.0 in reply to 1 [source]

So far, Richard didn't bite though :(

(3) By midijohnny on 2022-05-20 11:11:12 in reply to 2.1 [link] [source]

INSERT INTO votes(topic,user,comment)
VALUES ('Add split function','me','yes please!');

(4.3) By midijohnny on 2022-05-20 12:28:54 edited from 4.2 in reply to 3 [link] [source]

-- urgh.
-- Lots of ways this will fail.
WITH RECURSIVE 
  init(text,sep) AS (
    VALUES('here are some words and stuff',' ')),
  split(i,t,a) AS (
    SELECT 0,TRIM(text,sep)||sep,JSON_ARRAY() FROM init
	  UNION ALL 
    SELECT
        i+1,
	SUBSTR(t,INSTR(t,sep)+1),
	JSON_INSERT(a,'$[#]',SUBSTR(t,1,INSTR(t,sep)-1))
    FROM split JOIN init
    WHERE INSTR(t,sep)>0
    LIMIT 100)
SELECT a FROM split WHERE i=(SELECT MAX(i) FROM split)

a
["here","are","some","words","and","stuff"]

EDIT: there are better implementations already on the original post. EDIT2: The OP here wanted ROWS not an array.

(5) By midijohnny on 2022-05-20 12:26:58 in reply to 3 [link] [source]

I wonder if some the existing FTS5 tokenizer functionality could be leveraged to do the splitting? (i.e. perhaps expose some of its workings as a function, dunno)

(6.1) By midijohnny on 2022-05-20 12:36:38 edited from 6.0 in reply to 1 [link] [source]

Be aware that sqlite3 does have an array type - comes with JSON libraries.

So if you were to convert your 'csv' text into a JSON_ARRAY, you can then manipulate with the JSON functions - in this case probably the json_extract function.

In fact; you are no longer limited to arrays either - you can have object, nested structures etc - whatever JSON can do in fact.

It would still be nice to have a 'split' function though I think.

(7.1) By ddevienne on 2022-05-20 15:15:03 edited from 7.0 in reply to 6.1 [link] [source]

sqlite3 does have an array type

That's an exageration :). There are only 5 types: null, integer, real, text, and blob.

The same way SQLite does NOT have date or time or datetime or timestamp types, but functions to manipulate the text or numeric values as such.

By the same token, SQLite's JSON1 extension allows to manipulate text values as JSON, including arrays and objects. But that doesn't make it a true array type.

The closest SQLite has to custom types, beyond the 5 basic ones, are subtypes and pointer types. And those are purely transient, preserved only between nested function invocations mainly, AFAIK.

Storage-wise, it's the 5 basic types, and nothing else. --DD

(10) By midijohnny on 2022-05-20 14:13:22 in reply to 7.0 [link] [source]

Yup - I got that wrong - there is no 'type' of array etc. I just meant that there are ways of handling arrays (json arrays) I guess. (As opposed to dealing with pure strings)

Point taken!

(8) By mgr (mgrmgr) on 2022-05-20 13:41:07 in reply to 6.1 [link] [source]

As I wrote here, in the meantime you could convert your string into a JSON array like

'['||replace(json_quote(your_string),',','","')||']'

and use e.g. json_each to split to rows

sqlite> select value
   ...> from json_each('['||replace(json_quote('hello,world'),',','","')||']')
   ...> order by key;

hello
world

(9) By ddevienne on 2022-05-20 13:46:28 in reply to 8 [link] [source]

from json_each('['||replace(json_quote('hello,world'),',','","')||']')

Yeah, constrast that with from split('hello,world') :)

That's exactly why we need a split table-valued function!

Yours is a great tip, don't get me wrong. We just need something better and simpler.

(11) By mgr (mgrmgr) on 2022-05-20 15:35:36 in reply to 9 [link] [source]

Indeed - that's why I wrote in the meantime :-)

And I prefer this over the recursive CTE with a bunch of instr & substr and all the bookkeeping of indexes. Here, it's just an (ugly) chain of function calls and string concatenation.

Given the complete set of "joiners" and "splitters" (json_array & json_extract for columns, json_group_array & json_each for rows) in the JSON extension, I tend do store any array as JSON for the time being.

(12) By Keith Medcalf (kmedcalf) on 2022-05-20 16:28:02 in reply to 9 [link] [source]

This will work (using the previously mentioned statement_vtab) (see this post)

drop table if exists split;
create virtual table split using statement((
with tokens(entry, data, sep, valid) as
     (
        select null,
               :data,
               :sep,
               0
      union all
         select substring(data, 1, instr(data, sep) - 1),
                substring(data, instr(data, sep) + 1),
                sep,
                1
           from tokens
          where instr(data, sep) > 0
      union all
         select data,
                null,
                sep,
                1
           from tokens
          where instr(data, sep) < 1
     )
select entry
  from tokens
 where valid == 1
));

which will let you do, for example:

sqlite> select entry from split('this is a list of space separated tokens', ' ');
┌─────────────┐
│    entry    │
├─────────────┤
│ 'this'      │
│ 'is'        │
│ 'a'         │
│ 'list'      │
│ 'of'        │
│ 'space'     │
│ 'separated' │
│ 'tokens'    │
└─────────────┘

Not exactly builtin, but seamless nonetheless.

(13) By Keith Medcalf (kmedcalf) on 2022-05-20 16:39:55 in reply to 12 [link] [source]

Note that if you combine this with my ToBestType function, you can do like the following:

drop table if exists split;
create virtual table if not exists split using statement((
with tokens(entry, data, sep, valid) as
     (
        select null,
               :data,
               :sep,
               0
      union all
         select substring(data, 1, instr(data, sep) - 1),
                substring(data, instr(data, sep) + 1),
                sep,
                1
           from tokens
          where instr(data, sep) > 0
      union all
         select data,
                null,
                sep,
                1
           from tokens
          where instr(data, sep) < 1
     )
select ToBestType(entry) as entry
  from tokens
 where valid == 1
));

so that you can get, for example:

sqlite> select entry, typeof(entry) from split('this,is,a,list,15,of comma,,separated,7.2,tokens', ',');
┌─────────────┬───────────────┐
│    entry    │ typeof(entry) │
├─────────────┼───────────────┤
│ 'this'      │ 'text'        │
│ 'is'        │ 'text'        │
│ 'a'         │ 'text'        │
│ 'list'      │ 'text'        │
│ 15          │ 'integer'     │
│ 'of comma'  │ 'text'        │
│ NULL        │ 'null'        │
│ 'separated' │ 'text'        │
│ 7.2         │ 'real'        │
│ 'tokens'    │ 'text'        │
└─────────────┴───────────────┘

(14) By Keith Medcalf (kmedcalf) on 2022-05-20 17:59:53 in reply to 12 [link] [source]

This version seems to generate the best execution plan in that it avoids repeated redundant calls to the same function ... rather it uses separate branches in the recursion to perform the calculations one time only. This uses more ephemeral table space (about twice as much), but less CPU (about half as much).

drop table if exists split;
create virtual table if not exists split using statement((
   with recursive
        input(data, sep) as
        (
           values (:data, :sep)
        ),
        tokens(token, data, sep, pos, isValid) as
        (
           select null,
                  data,
                  sep,
                  instr(data, sep),
                  false
             from input
         union all
            select substr(data, 1, pos - 1),
                   substr(data, pos + 1),
                   sep,
                   -1,
                   true
              from tokens
             where pos > 0
         union all
            select null,
                   data,
                   sep,
                   instr(data, sep),
                   false
              from tokens
             where pos < 0
         union all
            select data,
                   null,
                   sep,
                   null,
                   true
              from tokens
            where pos == 0
       )
select ToBestType(token) as value
  from tokens
 where isValid
));

(15) By Keith Medcalf (kmedcalf) on 2022-05-20 20:28:41 in reply to 14 [link] [source]

Note that if you change values (:data, :sep) to values (:data, coalesce(:sep, ',')) the split is the opposite of group_concat (in that the separator, if unspecified, is ','):

select value 
  from split((
              select group_concat(value)
                from generate_series
               where start=1 
                 and stop=10
            ));
┌───────┐
│ value │
├───────┤
│ 1     │
│ 2     │
│ 3     │
│ 4     │
│ 5     │
│ 6     │
│ 7     │
│ 8     │
│ 9     │
│ 10    │
└───────┘

(16) By Keith Medcalf (kmedcalf) on 2022-05-20 21:49:11 in reply to 9 [link] [source]

Then there is this way to do it, which does not use recursion. Instead is scans the data for separators and then extracts a row for each one.

with inputs(data, lendata, sep, lensep) as
     (
         select data || sep,
                length(data || sep),
                sep,
                length(sep)
           from (
                    select :data as data,
                           coalesce(:sep, ',') as sep
                )
     ),
     separators(location) as
     (
         select 1 - lensep
           from inputs
      union all
         select value
           from generate_series
           join inputs
          where start = 1
            and stop = lendata
            and substring(data, value, lensep) == sep
     ),
     ranges(start, length) as
     (
         select location + lensep,
                lead(location) over () - location - lensep
           from separators, inputs
     )
select ToBestType(substring(data, start, length)) as value
  from ranges, inputs
 where length is not null

(17) By Alex Garcia (alexgarciaxyz) on 2022-06-22 15:52:49 in reply to 1 [link] [source]

In case anyone is looking for a 3rd party solution to this: sqlite-lines can mostly do this, is MIT licensed, and you can download pre-compiled loadable binaries (or build yourself).

https://github.com/asg017/sqlite-lines

.load ./lines0

-- lines_read will read each line from a file
select rowid, line
from lines_read('file.txt');

-- lines() works in in-memory strings and blobs
select rowid, line
from lines('a
b
c');
/*
1,a
2,b
3,c
*/

-- use a different delimiter, 1-character only
select rowid, line
from lines('a|b|c', '|');
/*
1,a
2,b
3,c
*/

No windows support, since getdelim() isn't available. Happy to take in contributions, however!