Add split function
(1) By yaugenka on 2022-05-20 06:08:08 [link] [source]
CSV in text field is one of the best alternative to the missing array type. There is group_concat
aggregate but no split
counterpart. It would be nice to implement one to unnest values into rows.
(2.1) By ddevienne on 2022-05-20 07:09:43 edited from 2.0 in reply to 1 [source]
See also https://sqlite.org/forum/forumpost/5b67d64df1
So far, Richard didn't bite though :(
(3) By midijohnny on 2022-05-20 11:11:12 in reply to 2.1 [link] [source]
INSERT INTO votes(topic,user,comment)
VALUES ('Add split function','me','yes please!');
(4.3) By midijohnny on 2022-05-20 12:28:54 edited from 4.2 in reply to 3 [link] [source]
-- urgh.
-- Lots of ways this will fail.
WITH RECURSIVE
init(text,sep) AS (
VALUES('here are some words and stuff',' ')),
split(i,t,a) AS (
SELECT 0,TRIM(text,sep)||sep,JSON_ARRAY() FROM init
UNION ALL
SELECT
i+1,
SUBSTR(t,INSTR(t,sep)+1),
JSON_INSERT(a,'$[#]',SUBSTR(t,1,INSTR(t,sep)-1))
FROM split JOIN init
WHERE INSTR(t,sep)>0
LIMIT 100)
SELECT a FROM split WHERE i=(SELECT MAX(i) FROM split)
a
["here","are","some","words","and","stuff"]
EDIT: there are better implementations already on the original post. EDIT2: The OP here wanted ROWS not an array.
(5) By midijohnny on 2022-05-20 12:26:58 in reply to 3 [link] [source]
I wonder if some the existing FTS5 tokenizer functionality could be leveraged to do the splitting? (i.e. perhaps expose some of its workings as a function, dunno)
(6.1) By midijohnny on 2022-05-20 12:36:38 edited from 6.0 in reply to 1 [link] [source]
Be aware that sqlite3 does have an array type - comes with JSON libraries.
So if you were to convert your 'csv' text into a JSON_ARRAY, you can then manipulate with the JSON functions - in this case probably the json_extract function.
In fact; you are no longer limited to arrays either - you can have object, nested structures etc - whatever JSON can do in fact.
It would still be nice to have a 'split' function though I think.
(7.1) By ddevienne on 2022-05-20 15:15:03 edited from 7.0 in reply to 6.1 [link] [source]
sqlite3 does have an array type
That's an exageration :). There are only 5 types: null
, integer
, real
, text
, and blob
.
The same way SQLite does NOT have date
or time
or datetime
or timestamp
types, but functions to manipulate the text or numeric values as such.
By the same token, SQLite's JSON1 extension allows to manipulate text
values as JSON, including arrays and objects. But that doesn't make it a true array type.
The closest SQLite has to custom types, beyond the 5 basic ones, are subtypes and pointer types. And those are purely transient, preserved only between nested function invocations mainly, AFAIK.
Storage-wise, it's the 5 basic types, and nothing else. --DD
(10) By midijohnny on 2022-05-20 14:13:22 in reply to 7.0 [link] [source]
Yup - I got that wrong - there is no 'type' of array etc. I just meant that there are ways of handling arrays (json arrays) I guess. (As opposed to dealing with pure strings)
Point taken!
(8) By mgr (mgrmgr) on 2022-05-20 13:41:07 in reply to 6.1 [link] [source]
As I wrote here, in the meantime you could convert your string into a JSON array like
'['||replace(json_quote(your_string),',','","')||']'
and use e.g. json_each
to split to rows
sqlite> select value
...> from json_each('['||replace(json_quote('hello,world'),',','","')||']')
...> order by key;
hello
world
(9) By ddevienne on 2022-05-20 13:46:28 in reply to 8 [link] [source]
from json_each('['||replace(json_quote('hello,world'),',','","')||']')
Yeah, constrast that with from split('hello,world')
:)
That's exactly why we need a split
table-valued function!
Yours is a great tip, don't get me wrong. We just need something better and simpler.
(11) By mgr (mgrmgr) on 2022-05-20 15:35:36 in reply to 9 [link] [source]
Indeed - that's why I wrote in the meantime :-)
And I prefer this over the recursive CTE with a bunch of instr
& substr
and all the bookkeeping of indexes. Here, it's just an (ugly) chain of function calls and string concatenation.
Given the complete set of "joiners" and "splitters" (json_array
& json_extract
for columns, json_group_array
& json_each
for rows) in the JSON extension, I tend do store any array as JSON for the time being.
(12) By Keith Medcalf (kmedcalf) on 2022-05-20 16:28:02 in reply to 9 [link] [source]
This will work (using the previously mentioned statement_vtab) (see this post)
drop table if exists split;
create virtual table split using statement((
with tokens(entry, data, sep, valid) as
(
select null,
:data,
:sep,
0
union all
select substring(data, 1, instr(data, sep) - 1),
substring(data, instr(data, sep) + 1),
sep,
1
from tokens
where instr(data, sep) > 0
union all
select data,
null,
sep,
1
from tokens
where instr(data, sep) < 1
)
select entry
from tokens
where valid == 1
));
which will let you do, for example:
sqlite> select entry from split('this is a list of space separated tokens', ' ');
┌─────────────┐
│ entry │
├─────────────┤
│ 'this' │
│ 'is' │
│ 'a' │
│ 'list' │
│ 'of' │
│ 'space' │
│ 'separated' │
│ 'tokens' │
└─────────────┘
Not exactly builtin, but seamless nonetheless.
(13) By Keith Medcalf (kmedcalf) on 2022-05-20 16:39:55 in reply to 12 [link] [source]
Note that if you combine this with my ToBestType function, you can do like the following:
drop table if exists split;
create virtual table if not exists split using statement((
with tokens(entry, data, sep, valid) as
(
select null,
:data,
:sep,
0
union all
select substring(data, 1, instr(data, sep) - 1),
substring(data, instr(data, sep) + 1),
sep,
1
from tokens
where instr(data, sep) > 0
union all
select data,
null,
sep,
1
from tokens
where instr(data, sep) < 1
)
select ToBestType(entry) as entry
from tokens
where valid == 1
));
so that you can get, for example:
sqlite> select entry, typeof(entry) from split('this,is,a,list,15,of comma,,separated,7.2,tokens', ',');
┌─────────────┬───────────────┐
│ entry │ typeof(entry) │
├─────────────┼───────────────┤
│ 'this' │ 'text' │
│ 'is' │ 'text' │
│ 'a' │ 'text' │
│ 'list' │ 'text' │
│ 15 │ 'integer' │
│ 'of comma' │ 'text' │
│ NULL │ 'null' │
│ 'separated' │ 'text' │
│ 7.2 │ 'real' │
│ 'tokens' │ 'text' │
└─────────────┴───────────────┘
(14) By Keith Medcalf (kmedcalf) on 2022-05-20 17:59:53 in reply to 12 [link] [source]
This version seems to generate the best execution plan in that it avoids repeated redundant calls to the same function ... rather it uses separate branches in the recursion to perform the calculations one time only. This uses more ephemeral table space (about twice as much), but less CPU (about half as much).
drop table if exists split;
create virtual table if not exists split using statement((
with recursive
input(data, sep) as
(
values (:data, :sep)
),
tokens(token, data, sep, pos, isValid) as
(
select null,
data,
sep,
instr(data, sep),
false
from input
union all
select substr(data, 1, pos - 1),
substr(data, pos + 1),
sep,
-1,
true
from tokens
where pos > 0
union all
select null,
data,
sep,
instr(data, sep),
false
from tokens
where pos < 0
union all
select data,
null,
sep,
null,
true
from tokens
where pos == 0
)
select ToBestType(token) as value
from tokens
where isValid
));
(15) By Keith Medcalf (kmedcalf) on 2022-05-20 20:28:41 in reply to 14 [link] [source]
Note that if you change values (:data, :sep)
to values (:data, coalesce(:sep, ','))
the split is the opposite of group_concat (in that the separator, if unspecified, is ','):
select value
from split((
select group_concat(value)
from generate_series
where start=1
and stop=10
));
┌───────┐
│ value │
├───────┤
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
│ 6 │
│ 7 │
│ 8 │
│ 9 │
│ 10 │
└───────┘
(16) By Keith Medcalf (kmedcalf) on 2022-05-20 21:49:11 in reply to 9 [link] [source]
Then there is this way to do it, which does not use recursion. Instead is scans the data for separators and then extracts a row for each one.
with inputs(data, lendata, sep, lensep) as
(
select data || sep,
length(data || sep),
sep,
length(sep)
from (
select :data as data,
coalesce(:sep, ',') as sep
)
),
separators(location) as
(
select 1 - lensep
from inputs
union all
select value
from generate_series
join inputs
where start = 1
and stop = lendata
and substring(data, value, lensep) == sep
),
ranges(start, length) as
(
select location + lensep,
lead(location) over () - location - lensep
from separators, inputs
)
select ToBestType(substring(data, start, length)) as value
from ranges, inputs
where length is not null
(17) By Alex Garcia (alexgarciaxyz) on 2022-06-22 15:52:49 in reply to 1 [link] [source]
In case anyone is looking for a 3rd party solution to this: sqlite-lines can mostly do this, is MIT licensed, and you can download pre-compiled loadable binaries (or build yourself).
https://github.com/asg017/sqlite-lines
.load ./lines0
-- lines_read will read each line from a file
select rowid, line
from lines_read('file.txt');
-- lines() works in in-memory strings and blobs
select rowid, line
from lines('a
b
c');
/*
1,a
2,b
3,c
*/
-- use a different delimiter, 1-character only
select rowid, line
from lines('a|b|c', '|');
/*
1,a
2,b
3,c
*/
No windows support, since getdelim() isn't available. Happy to take in contributions, however!