Is CSV parsing too liberal?

(1) By Harald Hanche-Olsen (hanche) on 2021-05-04 11:29:32 [link] [source]

Consider this transcript:

▶▶▶ create table foo(u,v);
▶▶▶ .shell cat foo.csv
"a b",c
x"d e",f
▶▶▶ .import -csv foo.csv foo
▶▶▶ select * from foo;
u       v
------  -
a b     c
x"d e"  f

Here the second line of the csv file is illegal according to RFC4180. I question whether the sqlite shell should raise an error, or at least issue a warning, when encountering such.

How I got bitten by this? I had a csv file that I wanted to import in reverse order. So I basically did

.import -csv '|tail -r foo.csv'

(On macOS, tail -r prints the file with the order of the lines reversed. On linux, use tac instead.)

I could not for the life of me understand why the first field of the last record included the quotes – until, after wasting much time, I realised that the csv file started with the infamous U+FEFF ZERO WIDTH NO-BREAK SPACE, also known as the byte order mark (BOM). And upon reversing the order of the lines, that was now heading the final line, where it was no longer ignored.

(2) By Richard Damon (RichardDamon) on 2021-05-04 13:00:30 in reply to 1 [source]

My guess this behavior is based on the principle of being liberal as to what you accept, but be strict as to what you generate. There are enough 'broken' CSV generators, that being too strict as to what you accept will cause issues with not being able to import from some sources.

If you really want to be strict, write your own program to do the import, then you can check for all the errors you want, including things that might be valid per the RFC, but are inconsistent data for your application.

The SQLite Shell is really just a convenience tool and isn't supposed to be all things for all users.

(3) By Harald Hanche-Olsen (hanche) on 2021-05-04 13:10:22 in reply to 2 [link] [source]

The SQLite Shell is really just a convenience tool and isn't supposed to be all things for all users.

I appreciate that, but I still think a warning (not an error) would be useful here, since this sort of thing could so easily lead to data corruption. Just like when the number of fields does not match the number of table columns.

It's not a big deal for me, though. Burned once, I have learned that lesson. But I did think it worth mentioning.

(4) By Ryan Smith (cuz) on 2021-05-04 14:05:07 in reply to 3 [link] [source]

I have quite some experience with CSV specifically and the habit is to either ignore text not within the quotes, or to add them to the quoted text (assuming a mistake on the creator's part).

Just produce such a reversed CSV file, or indeed the weirdly-quoted sample you gave, and open with or import it into Excel and Libreoffice or such - chances are you will see two different results but never an error or warning.

I even used to "hide" information in CSV files liek comments, by simply making the text I wish hidden be outside the quotes, so this file:
"Value 1" this is a comment, "Value 2"
would, to the Excel CSV importer look simply like:
"Value 1", "Value 2"
But, that is no longer the case. Newer Excel versions imported it as:
"Value 1 this is a comment", "Value 2"
which gives me chills, but anyway, that's besides the point. (Not sure if it is still the case with the newest versions).

Also, risking seeming like a bit of a Steve: You are only feeling this pain because you have specifically broken the file to no longer conform to the specification and to have a BOM in the middle. This is a basic run-of-the-mill File-BUG, and the only reason you are wishing for a CSV import warning is because your booboo would have been easier to find with the warning. The warning is not going to make your day-to-day future life better, nor ours. It is not a worthwhile implementation.

My advice: Suck this one up son!
(and learn from it, of course) :)

(5) By Harald Hanche-Olsen (hanche) on 2021-05-04 15:02:51 in reply to 4 [link] [source]

Right. Learning: check; sucking it up: working on it! And for the record, I hate the BOM. It just makes life difficult for everyone. Side note: I posted this not primarily for my own benefit, but for others who might run into the same sort of issue. So if it's not going to get fixed, it's all the same to me.

(6) By Richard Damon (RichardDamon) on 2021-05-04 16:19:19 in reply to 5 [link] [source]

Arguably, the error is either in tail, if BOMs are supposed to be ignored at the beginning of files, since it move it, or in the program that added them if they aren't supposed to be ignored at the beginning of files.

Maybe you just need a BOM stripper that you add to the pipe that just removes a leading (or maybe even embedded) BOM mark if programs disagree on the rules for BOMs.

(7) By Harald Hanche-Olsen (hanche) on 2021-05-04 18:07:33 in reply to 6 [link] [source]

Adding a BOM stripper is precisely what I did. One, convert Latin-1 to UTF-8, two, strip the BOM, three, reverse the order of the lines, and the file is ready for import.

if BOMs are supposed to be ignored at the beginning of files

Depends who is doing the supposing. Ignoring a BOM at the beginning of a file is not a universal or very widespread convention, except perhaps in the Windows world. (Did I contradict myself just now?) I can't say I blame tail for not including code for it. As far as it is concerned, it was just another odd character at the beginning of the line. If the line moves, so does the character.

(8) By Richard Damon (RichardDamon) on 2021-05-04 20:32:21 in reply to 7 [link] [source]

As you say, it is sort of a system convention, but not universal. In one sense, stripping out the BOM when reading a known to be text file is a fairly safe operation, as there really is no reason for one to be at the beginning of a file except to indicate it is UTF-8 encoded, even if by the Unicode Standard it isn't supposed to be used that way.

The one big problem with that convention is that it says that utilities that were designed to work with plain ASCII files, and didn't need to care about encodings (like tail) suddenly are now broken as they don't know to do this.

The advantage of this convention (and somewhat why it happens a lot in Windows) is that for programs that DO need to worry about encoding, it provides a big clue of UTF-8 vs local default 8-bit code page.

Windows has a bigger problem with this as it is older, especially in the 'business world where this was more of an issue.

Linux was late enough to be able to just assume UTF-8 unless it was told otherwise, and could live with those issues.

(9) By Simon Slavin (slavin) on 2021-05-05 12:53:52 in reply to 6 [link] [source]

I wonder if there's a standard for all unix/iinux utilities, saying that they should, or should not, understand BOMs.

The unix/linux command sed always removes BOMs. You can use sed along a pipe, or as a converter:

sed <input.csv > output.csv

Of course, if the BOM is there for a good reason, you're losing this functionality.

(10) By Harald Hanche-Olsen (hanche) on 2021-05-06 10:55:45 in reply to 9 [link] [source]

This is getting a bit far afield for the sqlite forum perhaps, but I have yet to come across a version of sed that removes a BOM automatically. Certainly not on the mac (not even GNU sed from macports), and not on ubuntu 20.04.

And on a side note, there may have been good reasons for BOMs in the past, but I think that time is long gone. It's just a hack, it causes no end of trouble, and it needs to go. The Unicode consortium already did their part, by removing the name BYTE ORDER MARK from the code point.

(11) By Simon Slavin (slavin) on 2021-05-06 15:39:12 in reply to 10 [link] [source]

Apologies. I confused two sources I was reading. Here's how to use sed to remove a BOM:

https://unix.stackexchange.com/questions/381230/how-can-i-remove-the-bom-from-a-utf-8-file

Or you can use tail:

tail -c +4 withBOM.txt > withoutBOM.txt