Inconsistent treatment of empty item on csv input
(1) By Harald Hanche-Olsen (hanche) on 2023-09-11 14:27:59 [link] [source]
It is very common for CSV files to not end in a newline. If it doesn't, and the final item on the final line is empty, one gets a NULL instead of the expected empty string.
Demo:
⬥ cat tt.csv
x,y
a,
b,←
The arrow ←
is NOT part of the file: I put it there to indicate the end of the input file. So the very last character of the input file is a comma.
sqlite> .nullvalue NULL
sqlite> .import -csv tt.csv tt
sqlite> select * from tt;
a|
b|NULL
sqlite>
That NULL
came as a surprise to me. It does not seem right.
(2.2) By Larry Brasfield (larrybr) on 2023-09-12 23:24:55 edited from 2.1 in reply to 1 [link] [source]
It is very common for CSV files to not end in a newline. If it doesn't, and the final item on the final line is empty, one gets a NULL instead of the expected empty string.
RFC 4180 is pretty clear on this. (It says "The last record in the file may or may not have an ending line break." and nothing more as to meaning, meaning its presence/absence carries no meaning.) Hence, I believe that this inconsistency should be considered to be a bug.
It does not seem right.
Agreed. Thanks for the report. It will be fixed soon, likely this week. This has been fixed on trunk. (It may or may not make the next release.)
(3.1) By Aask (AAsk1902) on 2023-09-11 15:02:16 edited from 3.0 in reply to 1 [link] [source]
That NULL came as a surprise to me.
Why?
- It should not be a surprise as, unless otherwise specified, the default value of a column is null.
- Your CSV does not supply a value; therefore, the value defaults to null.
A simulation of your scenario:
Script:
create table tblNew (col1, col2);
insert into tblNew (col1) values('sqlite'); /* No values specified for col2 */
.nullvalue null
select * from tblNew; /* The default value of a column value is null */
Result:
col1 col2
------ ----
sqlite null
(4) By Larry Brasfield (larrybr) on 2023-09-11 15:09:36 in reply to 3.0 [link] [source]
Why?
It should not be a surprise as, unless otherwise specified, the default value of a column is null.
Your CSV does not supply a value; therefore, the value defaults to null.
It surprises Harald because it is incorrect (and I suppose he has come to think of .import's CSV handling as robust and correct.)
As I mentioned, according to the CSV standard, the last record is allowed to be terminated by either a newline or end-of-file. Harald's example properly specified all column values for the last record; the final value was empty, not missing.
Your simulation script shows something else altogether.
(5.1) By Aask (AAsk1902) on 2023-09-11 16:03:28 edited from 5.0 in reply to 4 [link] [source]
the final value was empty, not missing.
empty implies that a value was specified as empty; I believe the value was missing.
However, for conformance with the CSV specification, SQLite may well treat missing as empty.
(6) By Larry Brasfield (larrybr) on 2023-09-11 17:25:18 in reply to 5.1 [link] [source]
empty implies that a value was specified as empty; I believe the value was missing.
While this is a contender for being one of the least consequential topics of debate (along with "How many angels can dance on the head of a pin?"), I think RFC 4180 is clear on this issue. It says "The last field in the record must not be followed by a comma." Thus, in content purported to conform to that convention, when the last record has a comma as its last character, we are required to treat the remainder of the record as the content of its last field. In Harald's example, that would be an empty sequence of characters. That's the way all preceding such records would be treated, (as his example demonstrated), so it would be inconsistent to treat only the last one specially.
While thinking this through, I realized that there is a set of cases where the RFC 4180 rule on EOF where a final record terminator might be placed will have to make a difference after all. CSV is perfectly capable of representing a sequence of records containing only one field each. For such a CSV instance, the final record terminator is required if the only field of the last record is empty.
(7) By Aask (AAsk1902) on 2023-09-11 18:00:50 in reply to 6 [link] [source]
How many angels can dance on the head of a pin?
Usually 42, but might be variable depending on the radius of the pin head.
(8) By Aask (AAsk1902) on 2023-09-13 06:34:22 in reply to 2.2 [link] [source]
It may or may not make the next release.
Is the break in backwaard compatibility the issue that will prevent its omission?
(9) By Harald Hanche-Olsen (hanche) on 2023-09-13 06:37:01 in reply to 8 [link] [source]
I can't speak for Larry of course, but I don't think that backward compatibility is as important for the CLI as it is for the core library. Far from it, indeed.
(10.1) By Aask (AAsk1902) on 2023-09-13 06:51:41 edited from 10.0 in reply to 9 [link] [source]
but I don't think that backward compatibility is as important for the CLI as it is for the core library.
... then there is the CSV extension ... and the related VSV extension ... and the XSV extension
(11) By Ryan Smith (cuz) on 2023-09-13 06:57:11 in reply to 8 [link] [source]
To Add to what Larry and Harald said, this kind of change should happen even in the face of breaking backward usage because whomever have used it so had input that did not conform or that read a wrongly set NULL value in a last record (probably mostly unnoticed), or input for which they added a trailing CRLF to make it work in which case the fix would have no adverse effect.
Nothing bad can come of it in terms of backward compatibility and while it's hard for me to propose an import scenario that would actually break adversely when the fix happens, if such is possible, it has to be due to terrible design on the part of the programmer and breaking his/her import would probably be a good thing.
[Clarification: this is not typically true, or advocated by me, just in this specific case since good CSV data (as any proper engineering effort would produce) has a very small chance of being affected by the bug, and zero chance of being affected by the fix. Bad CSV data on the other hand has only a very small chance of being affected by either the bug or the fix, but if it was bad data anyway, failing as soon as possible is a good thing.]
(12.2) By Aask (AAsk1902) on 2023-09-13 10:25:27 edited from 12.1 in reply to 11 [link] [source]
Nothing bad can come of it in terms of backward compatibility
... except inconsistency with the programmatic counterpart of the .import command (it it were fixed and CSV.C wasn't).
SQLite version 3.43.0 2023-08-24 12:36:59
Enter ".help" for usage hints.
sqlite> drop table if exists temp.t1;
sqlite> .shell type "e:\\temp\\csv.csv"
x,y
a,
b,sqlite> /* note no line feed in the file i.e. prompt 'sqlite>' appears after file content without line feed*/
sqlite> .nullvalue null
sqlite> .load 'D:/SQLite32/Extensions/CSV.dll'
sqlite> CREATE VIRTUAL TABLE temp.t1 USING csv(filename='e:/temp/csv.csv', header=1);
sqlite> SELECT * FROM t1;
x y
- ----
a
b null
sqlite>
(13) By Chris Locke (chrisjlocke1) on 2023-09-17 11:09:06 in reply to 3.1 [source]
Your CSV does not supply a value; therefore, the value defaults to null
But the row starting with 'a' didn't have a value for the second field either .. but that wasn't null.
So it would be a surprise if one missing entry wasn't a null and the other missing entry was.
Or are you implying that inconsistencies are a good thing?
(14) By Harald Hanche-Olsen (hanche) on 2023-09-17 12:22:35 in reply to 13 [link] [source]
To be pedantic, CSV does not have a concept of a “missing value”. If it did, that would certainly correspond to NULL in the SQL sense. In my example, the second field did have a specified value, namely the empty string.
That said, there certainly are plenty of scenarios where an empty string does denote a missing value. It's just not part of the CSV specification. To support this case, it would be handy to have an option to .import --csv
indicating that empty strings should be stored as NULL instead. Or even better, perhaps, the option could have the form --nullvalue STR
indicating that any occurrence of STR in the input be stored as NULL. As an escape, one could still read a quoted instance of STR as a string.
But this adds a level of complexity that I imagine the maintainers are not interested in. And I don't blame them, even though I could use such a feature myself. (If implemented, a corresponding feature should be added to the csv output format).
(15.3) By Aask (AAsk1902) on 2023-09-17 18:36:45 edited from 15.2 in reply to 13 [link] [source]
- Data inconsistency is a fact of life.
- RFC 4180 the CSV specification is NOT self-enforcing and it does not contain any data validation (business or regulatory) rules
- Column x of a CSV file can easily have, say, integers in the first n rows and a string value in, say, row n+1.Then SQLite will have a column x containing mixed values.
Data acquisition is NOT a one-step process in real life. It comprises of three sub processes Extract read the data as it comes Transform validate, ensuring consistency, compliance with business logic (e.g. items ordered cannot be less than 1 or a floating point value) and regulatory compliance if applicable and Load commit the transformed (valid) data into permanent tables.
There is no practical way to conclude that a CSV file is RFC 4180 compliant without examining every value by column.
A further take on consistency: if the value between two consecutive commas (OR between a comma and line feed/end of file) were treated as missing - that is a null value - ere would be no inconsistency (an empty string would be specified as consecutive double quotes). This is probably not an available option in the interest of backward compatibility.