feature request for CSV extension...
(1) By rbucker on 2020-08-19 21:34:40 [link] [source]
I spend a lot of time writing reports from flat datafiles. One thing about the csv.c extension is that it's FAST even when creating a table from the virtual table. My challenge is that I often create TSV in addition to CSV. I've done the batch sed/awk thing in the past but it seems to me that either forking the csv.c for a tsv.c or adding a
sep="" param would make more sense.
As I'm not qualified to write SQLite internals I was hoping for some guidance. Forking seems to add a lot of code where a param requires more test cases... and I may not be prepared for that either.
All that said, some guidance and then we'll see what I can do.
(2) By Keith Medcalf (kmedcalf) on 2020-08-20 00:39:18 in reply to 1 [link] [source]
Take a look at the following modification that I made which embodies several changes from the default csv.c extension module.
/* ** 2020-02-08 modified by Keith Medcalf who also disclaims all copyright ** on the modifications and hereby places this code in the public domain ** ** This file contains the implementation of an SQLite virtual table for ** reading VSV (Variably Separated Values), which are like CSV files, ** but subtly different. VSV supports a number of extensions to the ** CSV format as well as more processing options. ** ** ** Usage: ** ** create virtual table temp.vsv using vsv(...); ** select * from vsv; ** ** The parameters to the vsv module (the vsv(...) part) are as follows: ** ** filename=STRING the filename, passed to the Operating System ** data=STRING alternative data ** columns=N columns parsed from the VSV file ** header=BOOL whether or not a header row is present ** skip=N number of leading data rows to skip ** rsep=STRING record separator ** fsep=STRING field separator ** validatetext=BOOL validate UTF-8 encoding of text fields ** affinity=AFFINITY affinity to apply to each returned value ** nulls=BOOL empty fields are returned as NULL ** ** ** Defaults: ** ** filename / data nothing. You must provide one or the other ** it is an error to provide both or neither ** schema nothing. If not provided then one will be ** generated for you from the header, or if no ** header is available then autogenerated using ** field names manufactured as cX where X is the ** column number ** columns nothing. If not specified then the number of ** columns is determined by counting the fields ** in the first record of the VSV file (which ** will be the header row if header is specified), ** the number of columns is not parsed from the ** schema even if one is provided ** header=no no header row in the VSV file ** skip=0 do not skip any data rows in the VSV file ** fsep=',' default field separator is a comma ** rsep='\n' default record separator is a newline ** validatetext=no do not validate text field encoding ** affinity=none do not apply affinity to each returned value ** nulls=off empty fields returned as zero-length ** ** ** Parameter types: ** ** STRING means a quoted string ** N means a whole number not containing a sign ** BOOL means something that evaluates as true or false ** it is case insensitive ** yes, no, true, false, 1, 0 ** AFFINITY means an SQLite3 type specification ** it is case insensitive ** none, blob, text, integer, real, numeric ** ** STRING means a quoted string. The quote character may be either ** a single quote or a double quote. Two quote characters in a row ** will be replaced with a single quote character. STRINGS do not ** need to be quoted if it is obvious where they begin and end ** (that is, they do not contain a comma). Leading and trailing ** spaces will be trimmed from unquoted strings. ** ** filename =./this/filename.here, ... ** filename =./this/filename.here , ... ** filename = ./this/filename.here, ... ** filename = ./this/filename.here , ... ** filename = './this/filename.here', ... ** filename = "./this/filename.here", ... ** ** are all equivalent. ** ** BOOL defaults to true so the following specifications are all the ** same: ** ** header = true ** header = yes ** header = 1 ** header ** ** ** Specific Parameters: ** ** The platform/compiler/OS fopen call is responsible for interpreting ** the filename. It may contain anything recognized by the OS. ** ** The separator string containing exactly one character, or a valid ** escape sequence. Recognized escape sequences are: ** ** \t horizontal tab, ascii character 9 (0x09) ** \n linefeed, ascii character 10 (0x0a) ** \v vertical tab, ascii character 11 (0x0b) ** \f form feed, ascii character 12 (0x0c) ** \xhh specific byte where hh is hexadecimal ** ** The validatetext setting will cause the validity of the field ** encoding (not its contents) to be verified. It effects how ** fields that are supposed to contain text will be returned to ** the SQLite3 library in order to prevent invalid utf8 data from ** being stored or processed as if it were valid utf8 text. ** ** The nulls option will cause fields that do not contain anything ** to return NULL rather than an empty result. Two separators ** side-by-each with no intervening characters at all will be ** returned as NULL if nulls is true and if nulls is false or ** the contents are explicity empty ("") then a 0 length blob ** (if affinity=blob) or 0 length text string. ** ** For the affinity setting, the following processing is applied to ** each value returned by the VSV virtual table: ** ** none no affinity is applied, all fields will be ** returned as text just like in the original ** csv module, embedded nulls will terminate ** the text. if validatetext is in effect then ** an error will be thrown if the field does ** not contain validly encoded text or contains ** embedded nulls ** ** blob all fields will be returned as blobs ** validatetext has no effect ** ** text all fields will be returned as text just ** like in the original csv module, embedded ** nulls will terminate the text. ** if validatetext is in effect then a blob ** will be returned if the field does not ** contain validly encoded text or the field ** contains embedded nulls ** ** integer if the field data looks like an integer, ** (regex "^ *(\+|-)?\d+ *$"), ** then an integer will be returned as ** provided by the compiler and platform ** runtime strtoll function ** otherwise the field will be processed as ** text as defined above ** ** real if the field data looks like a number, ** (regex "^ *(\+|-)?(\d+\.?\d*|\d*\.?\d+)([eE](\+|-)?\d+)? *$") ** then a double will be returned as ** provided by the compiler and platform ** runtime strtold function otherwise the ** field will be processed as text as ** defined above ** ** numeric if the field looks like an integer ** (see integer above) that integer will be ** returned ** if the field looks like a number ** (see real above) then the number will ** returned as an integer if it has no ** fractional part and ** (a) your platform/compiler supports ** long double and the number will fit in ** a 64-bit integer; or, ** (b) your platform/compiler does not ** support long double (treats it as a double) ** then a 64-bit integer will only be returned ** if the value would fit in a 6-byte varint, ** otherwise a double will be returned ** ** The nulls option will cause fields that do not contain anything ** to return NULL rather than an empty result. Two separators ** side-by-each with no intervening characters at all will be ** returned as NULL if nulls is true and if nulls is false or ** the contents are explicity empty ("") then a 0 length blob ** (if affinity=blob) or 0 length text string will be returned. ** */
A set of various extensions including vsv.c can be found in
(3) By rbucker on 2020-08-20 12:30:15 in reply to 2 [link] [source]
That's awesome... however, I tried to compile it on freebsd 12.1 with the following...
cc -g -fPIC -I /usr/local/include -shared sqliteext/vsc.c -o lib/vsc.so
and received all manner of
C99 warnings. (just 6 warnings).
PS... in the header you list the params but omitted the
schema which you included in the defaults.
(4) By rbucker on 2020-08-20 12:48:03 in reply to 2 [link] [source]
FYI... one of the warnings was
strnicmp. Given that one of the params was a constant string...
sqlite3_stricmp() will work great.
(5) By Keith Medcalf (kmedcalf) on 2020-08-20 18:49:35 in reply to 4 [link] [source]
Yeah, some compilers seem to complain about strnicmp because somewhere along the line the name of the function changed (neither MSVC nor MinGW complain though).
Anyway, I changed the strnicmp to use sqlite3_strnicmp, so that should fix that. These compares are only used during the initial connect phase to parse parameters so it should not make any significant difference, performance wise.
There are two other complaints I get with c99 that are remedied by extra parenthesis and by an explicit cast. So now it compiles on MinGW64 GCC 9.1.0 with -std=c99 -Wall -pedantic producing no warnings.
Note that the majority of the code is Richard's original csv.c extension just with some modifications made to it to add some more processing options ... (plus some syntax changes to make it "look nicer" while I was making the changes and testing them -- each to their own formatting wise).
(6) By rbucker on 2020-08-21 13:43:40 in reply to 5 [link] [source]
I made all those same changes to get it to compile ... so thanks. I'll have to test is from the sqlite CLI to make sure it works... I am trying to link vsv to my golang project but getting an 'out of memory' error and I'm not sure where it's triggering yet.
(7) By rbucker on 2020-08-21 13:49:00 in reply to 5 [link] [source]
just a reply to confirm that the OOM error is coming from sqlite when I tried to access the virtual table.
sqlite> CREATE VIRTUAL TABLE temp.zipcode USING vsv(filename='files/US.tsv', header=no, fsep='\t', schema='create table x(country_code,postal_code,place_name,admin_name1,admin_code1,admin_name2,admin_code2,admin_name3,admin_code3,latitude,longitude,accuracy)'); sqlite> select count(*) cnt from temp.zipcode; Error: out of memory sqlite> select count(*) cnt from zipcode; Error: out of memory
(8) By rbucker on 2020-08-21 13:54:02 in reply to 5 [source]
I think I found the problem... I needed to add the "columns" number to the create table. :(