SQLite User Forum

Proposed change to allow null values on csv import
Login

Proposed change to allow null values on csv import

(1) By doug (doug9forester) on 2022-02-04 19:58:57 [link] [source]

I propose the following change to the CLI for the .import command:

.import --empty-as-null ...
This option will set a NULL value into the field if the input field is empty

I have a patch I can give you, but I haven't (yet) signed an agreement with SQLite.org. I haven't tested the patch because I have no idea how to build SQLite on Windows 10 (64-bit).

(2) By Larry Brasfield (larrybr) on 2022-02-04 21:05:08 in reply to 1 [link] [source]

You can get some idea of how to build here.

I presume you are aware that you can .import into an existing table which can have triggers to do whatever input transformation your heart desires, as long as it is done on a per-row basis. That allows more flexibility than forcing all columns to be treated alike.

(3) By Simon Slavin (slavin) on 2022-02-04 22:34:41 in reply to 1 [link] [source]

Someone else did it last time, so it's my turn.

The CSV format is defined in RFC4180:

https://datatracker.ietf.org/doc/html/rfc4180

There's no standard in CSV files for a NULL value. Every value in a CSV file is a string. You can have zero-length strings. Spaces or tabs either side of a comma or linebreak are part of the value. Double-quotes need not appear, but if one is used to start the string, one must be used to end the string. So a CSV file containing

aaa,,"ccc" aaa,NUL,"ccc" aaa,NULL,"ccc" 123,,456 123,0,"4 6"

provides five rows of three strings, the second string on the first line being a zero-length string.

By all means establish your own file format, which includes a way to encode a NUL/NULL value. Call it CSVN or something. But it's not CSV, and any file importer which calls itself 'csv' or refers to 'csv format' should not understand NUL or ',,' that way.

Why is this important ? Some day, someone else may have to understand your program. Might be another programmer. Might be someone auditing the company for fraud. Might be someone valiantly trying to rescue some horribly corrupt data files. If I see some code that makes me think 'CSV importer needed here' I should be able to use what I know about CSV format, or plug in any CSV importer I'm familiar with.

(4) By Keith Medcalf (kmedcalf) on 2022-02-04 23:04:50 in reply to 1 [source]

My VSV (variably-separated-values) extension does that -- it is an extension of the SQLite3 maintained CSV extension.

http://www.dessus.com/files/vsv.c

You would need to compile the extension and load it into the connection for it to work.

(5) By anonymous on 2022-02-05 07:55:27 in reply to 4 [link] [source]

Yes, use Keith’s extension.

I don’t exactly know how the virtual table awesomeness works but I love them and I love this extension.

Thank you, Keith, for creating it and sharing it.

(6) By anonymous on 2022-02-05 23:08:59 in reply to 2 [link] [source]

I presume you are aware that you can .import into an existing table which can have triggers to do whatever input transformation your heart desires

This is also possible for views as well. You can import into a view, with triggers to process each row in order. (I had done this, before.)

(7) By Heribert (derhexer) on 2022-02-06 21:52:07 in reply to 4 [link] [source]

I think there is an error in vsv.c.
If i use only filename and schema for vsv, i cannot access to the virtual table.

e.g. using vsv(filename='C:\tmp\test.csv',schema='CREATE TABLE x(Name)')


I found the reason in vsvtabConnect. If column count is not set -> nCol == -99.


static int vsvtabConnect(
                        sqlite3 *db,
                        void *pAux,
                        int argc, const char *const*argv,
                        sqlite3_vtab **ppVtab,
                        char **pzErr
                        )

    if (VSV_SCHEMA==0)
    {
	...
    }
    else if (nCol<0)
    {
	nCol = 0; //nCol needs to be set
        do
        {
            vsv_read_one_field(&sRdr);
            nCol++;
        }
        while (sRdr.cTerm==sRdr.fsep);
    }

(8) By Keith Medcalf (kmedcalf) on 2022-02-06 23:48:56 in reply to 7 [link] [source]

Read the instructions.

If you are using the schema= then you must also specify the number of columns because the number of columns is never parsed from the provided SQL statement.

(9.1) By Heribert (derhexer) on 2022-02-07 05:39:16 edited from 9.0 in reply to 8 [link] [source]

You are right. I didn`t read that. But in this case (missing column count when using a schema) it would be nice, if vcv results with an appropriate error message and did not count fields.

(10.1) By Keith Medcalf (kmedcalf) on 2022-02-07 06:54:00 edited from 10.0 in reply to 9.1 [link] [source]

Actually, it should count the number of fields in the first line (which will be the header line if headers=1) or the first skipped row (if headers is left at 0 and skip is greater than the default of 0).

The fix to make it work like it was supposed to work (it is a fencepost error, of a sort) -- Thank-you, this is the only path that has this error.

Index: ext/private/vsv.c
==================================================================
--- ext/private/vsv.c
+++ ext/private/vsv.c
@@ -1168,10 +1168,11 @@
             goto vsvtab_connect_oom;
         }
     }
     else if (nCol<0)
     {
+        nCol = 0;
         do
         {
             vsv_read_one_field(&sRdr);
             nCol++;
         }

Edited to add a thanks and the error has been fixed.

(11.1) By Mark Lawrence (mark) on 2022-08-18 09:37:46 edited from 11.0 in reply to 1 [link] [source]

I would like to repeat (or +1) the request for this feature to be added to the SQLite CLI. The workarounds proposed involve quite some effort.

Compiling extensions for non-c developers on each upgrade involves repeated web searches and "how do I load this thing automatically?" and "what is that virtual table syntax again?" and I end up just not going there.

The other alternative requires some heavy boilerplate. I don't want to be manually creating tables and triggers for temporary imports of 10s of columns. The first feature I lose is the great and recently-reworked auto-naming.

It is clear that there is a mismatch between CSV and SQL. Since I can't modify the input in any way to ease crossing that boundary it would be nice if the CLI assisted.

[Edited to change "import" to "input" in last paragraph]

(12) By Sunny Saini (SunnySaini_com) on 2022-08-19 02:16:55 in reply to 4 [link] [source]

Is it possible to load an extension in Android phones using Apps like aSQLiteManager etc.?

(13) By anonymous on 2022-08-19 12:39:21 in reply to 11.1 [link] [source]

The other alternative

No, there's almost always more than one alternative, and in this case, it doesn't have to be heavy boilerplate.

Maybe this has been suggested before, but how about a dead simple UPDATE statement, one per column, that replaces '' with NULL? Or a single VIEW that does the same thing? Both are simple and dead obvious what they do.

(14) By CharlesNepote on 2023-01-25 09:15:05 in reply to 13 [link] [source]

how about a dead simple UPDATE statement, one per column, that replaces '' with NULL?

Yes it's dead simple and I'm doing this every day. But it takes 50+ minutes for my whole database, which is very costly and annoying for my data processing chain.

I understand our DB is not common: 10GB+, ~200 columns and 2.7 millions lines. But .import --empty-as-null would be nice for us.

Here is the code I use to convert empty to NULL:

time sqlite3 products_new.db ".schema all" | \
  sed -nr 's/.*\[(.*)\] (INTEGER|FLOAT).*/\1/gp' | \
  xargs -I % sqlite3 products_new.db -cmd \
    "SELECT 'Convert empty to NULL for [%]';" \
    "PRAGMA journal_mode=OFF;" \
    "PRAGMA synchronous=0;" \
    "PRAGMA locking_mode=EXCLUSIVE;" \
    "PRAGMA temp_store=MEMORY;" \
    "PRAGMA page_size = 32768;" \
    "UPDATE [all] SET [%] = NULL WHERE [%] = '';"

Maybe there are better things to do...

(15) By Kees Nuyt (knu) on 2023-01-25 15:33:44 in reply to 14 [link] [source]

Another way to generate an empty2null script:

#!/bin/bash

function empty2null() {
	for t in $(
sqlite3 "$db" "SELECT name FROM sqlite_master
WHERE type=='table' ORDER BY name" )
	do
		nc=0
		cma="UPDATE \"${t}\" SET
 "
		for c in $( 
printf "SELECT '\"'||name||'\"' FROM pragma_table_info('%s')
WHERE type == 'TEXT' ORDER BY cid ASC;\\n" "$t" \
			| sqlite3 "$db" )
		do
#			printf "%s %s = CASE %s WHEN '' THEN NULL ELSE %s END" "$cma" "$c" "$c" "$c"
			printf "%s %s = IIF(%s=='',NULL,%s)" "$cma" "$c" "$c" "$c"
			cma='
,'
			nc=$(( nc + 1 ))
		done
		test $nc -gt 0 && printf ';\n'
	done
}

Adding a WHERE clause to restrict the update to rows that actually contain '' in a TEXT column is left as an exercise for the hobbyist.

-- 
Regards,
Kees Nuyt

(16) By Kees Nuyt (knu) on 2023-01-25 18:28:41 in reply to 15 [link] [source]

Ok then, the optimized version:

#!/bin/bash
function empty2null() {
  for t in $(
sqlite3 "$db" "SELECT name FROM sqlite_master
WHERE type=='table' ORDER BY name" | tee ${teefile} )
  do
    nc=0
    onec=''
    mulc=''
    where='WHERE'
    cma="UPDATE OR ABORT \"${t}\" SET
 "
    for c in $( 
printf "SELECT '\"'||name||'\"' FROM pragma_table_info('%s')
WHERE type == 'TEXT' ORDER BY cid ASC;\\n" "$t" \
      | sqlite3 "$db" | tee -a ${teefile} )
    do
      onec+=$(printf "%s %s = NULL" "$cma" "$c")
      mulc+=$(printf "%s %s = IIF(%s=='',NULL,%s)" "$cma" "$c" "$c" "$c")
      cma='
,'
      test $nc -gt 0 && where+=' OR'
      where+=" $c==''"
      nc=$(( nc + 1 ))
    done
    if [ $nc -gt 0 ]
    then
      if [ $nc -gt 1 ]
      then
        printf '%s\n  %s;\n' "$mulc" "${where}"
      else
        printf '%s\n  %s;\n' "$onec" "${where}"
      fi
    fi
  done
}
-- 
Enjoy
Kees Nuyt

(17) By Aask (AAsk1902) on 2023-01-25 19:42:36 in reply to 5 [link] [source]

Thank you, Keith, for creating it and sharing it.

I echo this sentiment wholeheartedly.

See also Keith;s 12.1 response.

PS: Why don't the last response in any thread have a [link] hyperlink towards the end of the subject line?