Bug? Warning 'unescaped " character' disabled by adding single backslash before first doublequote

(1) By Oliver Bandel (oliban) on 2022-02-16 17:22:39 [link] [source]

I have a problem with importing a tsv file.
I would expect that doublequotes are not a problem, when
using \t as separator, even though I switched to csv mode.
But I used .separator "\t"

But maybe I have overlooked something regarding the handling of csv and doublequotes from within the sqlite3 tool.

The bahaviour that motivated me to write this message here,
is the following:

I could disable this warning when replacing the first occurence of "  by \".

Then no more warnings will be printed!
Independent of how doublequotes are handled in csv mode (or csv mode with \t as separator),
this does not make sense to me.

So I think this might be a bug.

---------------------------
Used version:
  "SQLite version 3.37.2 2022-01-06 13:25:41"
---------------------------

Arch-package-Info:
---------------------------
Name            : sqlite
Version         : 3.37.2-1
Description     : A C library that implements an SQL database engine
Architecture    : x86_64
URL             : https://www.sqlite.org/
Licenses        : custom:Public Domain
Groups          : None
Provides        : sqlite3=3.37.2
Depends On      : readline  zlib
Optional Deps   : None
Required By     : avidemux-cli  colord  docker  gnupg  ipython  libaccounts-glib  libchamplain  libsoup  libsoup3  neomutt  nss  python-apsw  python2  python2-apsw  qt5-base  subversion  tracker3  webkit2gtk  zathura
Optional For    : apr-util  python
Conflicts With  : None
Replaces        : sqlite3
Installed Size  : 7,37 MiB
Packager        : Andreas Radke <andyrtr@archlinux.org>
Build Date      : Do 06 Jan 2022 18:29:32 CET
Install Date    : Mi 12 Jan 2022 19:30:18 CET
Install Reason  : Installed as a dependency for another package
Install Script  : No
Validated By    : Signature
---------------------------


For testing I created a directory "sqlite-unescaped-dblquote-problem" with the following files:

  example_unexpected_ok.tsv
  example_with_errors.tsv
  importer.sqlite3
  starttest.bash

The bash file starts the test.
It just uses importer.sqlite3 for creating tables by importing the two tsv files.

The names of the tsv files explain what happens:
one import creates prolems, the other one - unexpectedly - NOT!

I did create a *.tgz file from that directory and used base64 to create an ASCII representation of it.

To recreate the tgz file, do the following:

Paste the base64 stuff shown below (between the start/end markers) into a file "sqlite-unescaped-dblquote-problem.tgz.base64",
then do the following from the shell (the '$' represetns shell prompt):

   $ base64 -d sqlite-unescaped-dblquote-problem.tgz.base64  > sqlite-unescaped-dblquote-problem.tgz




=== start of base64 stuff ===
H4sIAAAAAAAAA+1YPXPbRhBVC/6KDV2oCSh8kSAzaSTFjuSRZc9QiV14RnMElsSZhzvq7kCaatOm
cZMmhceJi/g/pKPU5Udlj5Ql2R4P43giJzFeQRCLw93e2327NzAnglv0K4kmYxPM/XwgTipFpolW
A4Hl1sZHIyCkQeCuYdp+4/oaG2ESBXEUJVEcbtDTMAk2oP3xS69HZSzTABtK8Cnq949b9/w/CrM2
/viUlROBxzNui2PUWmnTsmb6AWu4AHfeH/+oE4SX8Q/TzkYQxWkabkDwj+36Gj7z+O/3+4fe0f7R
we1G0Et6fthOA+8es6g5ExwlnFYaHgwFjnDGjUFpsoINbasRBr223+kmgfctStRMAJM5ZKqcMM0s
cQUoc5VpLpVQo3mrEQRxx++l8SOvT9aJZVCizpitNMNGFHU7fhB3H3nN+1Oe88o04TvpGDfczmFb
SiYMqCHsFlhyYzVNSK5Gfq/dSb19Sf5KWlXRMHiiKu2uNNoWqEs1KeaGZ4beoATz293wkbdNA1Ag
nD+3nMvzX/mQZwhku1r17Acyc2hui9Z+C3arU9aEnEvYZ+fPuVs88INuknh9LJm0PIMZDmiXcRT7
cdwLvYfX6BI8KxB2kFu9eDlCyNHAvjSW28oaWPw0VlKVfEx2DbelnfFsLCo5MmLxUjobk8tHeyor
aMKK3BwufteXLyI0d3QlFRxgRc8pHE1aTAsuHe1J6Kc9YvaBVhkibWFkLrhxm7WYQ99SwA0csinR
9tovdJuJY+Ir7Xl3tJmV4MPe2TM7lvOZ9O+NZxLOnh2N5UzMRnM4mJ//rDfFl7A3PJkJGlbAQSFz
U9CgvROxObcAf/zWPH8Bd/RsuAkHm7Ncl3Nx4vKu40dhkHhHmlEUKC4lSuLFJRTdZFYrMScHKUuS
rh9Haefj4xd1OpHvqo33DTfMUP6ARoNMZ8VyXYopGy39AENaIHaUXFI2EmpALPVVZQtiKAm6fq8X
d7w+U5PzFxblE3ZJIYNTBlOVq4nmU415Bc27TCsj2BTOfqSU5U2Xxu3Ij9u9xNvVvOTLDKaitNwP
OTLAgk250q1G1I5TPw7j1LuSQ/N7JljB2RtqIfvR4hc9UlNO/GAjTOjFMEhjb5fpnCv/EGdODUna
8ZMORffQqRAhQyFgwC/0GsZdx1Dc87atpcBY50OQ0jtB0O14O5UQaLlLTBCM+Mj44pVdvAKt5i4O
AxQu0TfJWLl0Hy1euYn5SUU3aMExfd3UPKAfUgmZsdLGTfuQCaEkxyalSDf2A9q/d/dC3KtVS0b1
Bd2FZEZKcRMTy+SuNBOlLZSVsLxUORNOCUHbb4dh2/m+dB1X8gFTjeklVlSShObDDpNjOMQRFTK4
R/xS9WANalmJ341DooMW49OVAqn2qSuZmxmlkFyVxl4Udy+9pYgsXcTlfjOymIyKa0Ya+8T1/6/3
fxrydEJixPxYjT/oBLC+/0dv9f+EbHX/vwn86/r/4/oAUB8A6gNAfQCoDwA3gfX9n5eOSNSt1dD4
w9dY0/+DJE6uvv8ECfX/KA6juv/fBG7B/jK+QM1KblHRIuXmjArXUKsS6JQHQy6w0bgF21TvlP4K
7i+JcBrJkar9HWp6EFHI3JiLudwEdOdfodEiDVKBMdNGy+DyiKA0NB/bZqO1SjB4z3cmWP491rjM
wvyd4e8cS2FWzOmvc6fPpqvdDJhBsGq1lzccu+6iccPREZEPWvngUx/NbwTr9e/0Yan52haxWPyd
NdZ+/223r/Qfpk7/VBNq/d8Ebn2xVRm9NeByy4W30WhclHn4Gt6u/J+FIGrUqFGjRo0aNWrUqFGj
Ro3/If4EUrWxkwAoAAA=
=== end of base64 stuff ===

(2) By Oliver Bandel (oliban) on 2022-02-16 18:05:15 in reply to 1 [link] [source]

When not "correcting" the initial " by \" then at least one/some following lines of longer files will not be read correctly.

So when the 'unescaped " character' is printed as warning,
lines that not even have any doublequote will be inserted wrong.
(For example I got an empty fields instead of some text).

When doing the \" 'correction', the data is inserted correctly into the table; the text appears then in the cell, as it is in the tsv file.

(3) By Richard Hipp (drh) on 2022-02-16 18:30:41 in reply to 2 [link] [source]

You have provided us with vague hints about what you are experiencing and what you think it is doing wrong. A concrete example would do more to encourage people to investigate and perhaps help you.

(7) By Richard Hipp (drh) on 2022-02-16 21:06:13 in reply to 3 [link] [source]

Apologies - I didn't notice the test data below the fold....

(4) By Larry Brasfield (larrybr) on 2022-02-16 18:56:30 in reply to 2 [link] [source]

Before throwing the word "bug" around, or becoming much attached to notions of what should happen such that "surprise" occurs, I recommend that you read the "standard" (so-to-speak) on CSV. There, you will find that neither of your TSV input files conforms (allowing for substitution of the delimiter, which is a common extension.) Pay particular attention to item 5 at the link regarding use of double-quote in the field content rather than as a delimiter.

Here is a session which acted as I expected, because I created input which does conform to RFC4180 (except for using tab as the delimiter, which I often prefer.)SQLite version 3.35.5 2021-04-19 18:32:05 ... sqlite> .mode tabs sqlite> .import ok.tsv SeeTSV sqlite> select * from SeeTSV; no quotes a "quoted" word sqlite> .q $ cat ok.tsv "Col_1" "Col_2" no quotes "a ""quoted"" word" $. Note that there is an ASCII 9 (aka "tab") between the fields within each of the two TSV records.

I had enough trouble understanding your post that I decoded your base64 so that I could examine your inputs. At that point, it was clear that you have ideas about escaping within CSV that do not find support in RFC4180. Given that your input is not proper CSV or TSV, I cannot regard what you saw, however unexpected, as a bug.

(5) By Larry Brasfield (larrybr) on 2022-02-16 19:18:59 in reply to 2 [link] [source]

Further to Richard's point:

Within your post(s) is much extraneous matter and a scarcity of: (1) what you saw; (2) what you got; (3) with what provided as input^a; (4) what you would regard as correct; and (5) why you believe as you do regarding correctness.

a. Providing your inputs in base64, which also contain 20 or more times the data needed to demonstrate the issue, hardly counts as aiding those who might help you. A short, properly quoted for markdown, inline example of your problematic or demonstration inputs is far more likely to be examined by those poised to help.

(6.1) By Keith Medcalf (kmedcalf) on 2022-02-16 20:55:08 edited from 6.0 in reply to 1 [link] [source]

Your Tab Separated Value file does not comform with the requirements for a valid TSV/CSV file. The valid compliant format is:

ISSN	TITLE
0949-1570	Materialien zur Pflegewissenschaft.
1095-6840	General and comparative endocrinology.
0036-973X	Scripta mercaturae
2286-038X	"""Ovidius"" University Annals of Chemistry."
1572-9567	International journal of thermophysics.
1224-581X	Analele ştiinţifice ale Universităţii "Al.I. Cuza" din Iaşi
1570-0844	Semantic web.
0323-3391	Wissenschaftliche Beiträge des Instituts Ökonomik der Entwicklungsländer an der Hochschule für Ökonomie "Bruno Leuschner" Berlin.
0041-798X	Proceedings of the United States Naval Institute.
0333-5879	Frswm - Hđtknywn-Mkwn đTknwlwgy Lyśr'l, Hfqwlđth Lhndsh đHql'yt  ̨"š Frwf' L'wdrmylq
0946-2104	Trace elements and electrolytes
2248-3276	Analele ştiinţifice ale Universităţii "Al.I. Cuza" din Iaşi
2662-4176	Disaster research and management series on the global South.
0408-9936	Saopštenja Instituta za vodoprivredu "Jaroslav Černi".
1552-3594	Criminal justice and behavior.
2537-3137	Annals of "Valahia" University of Târgovişte
1437-1073	Cardio-News.
1476-4679	Nature cell biology.
1382-4139	Attractor.
0776-0086	Bulletin de la Société royale belge d'études géologiques et archéologiques "Les chercheurs de Wallonie"
0983-0537	Journal de la marine marchande et du transport multimodal.
0005-5115	Buletin ekonomi suku tahunan - Bank Negara Malaysia
0174-8319	Archiv für Genossenschaftswesen
1095-9238	Journal of archaeological science.

The "rules" for field content are that if the field contents contain either the field or record separator or begins with a double-quote mark, then the field must be quoted with double-quotes and all embeded double-quotes must be replaced with two double-quotes side by each.

Once you create a properly formatted input file, everything works correctly.

SQLite version 3.38.0 2022-02-16 19:54:02
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .mode csv
sqlite> .separator "\t"
sqlite> .import example_compliant.tsv x
sqlite> .mode list
sqlite> select * from x;
ISSN|TITLE
0949-1570|Materialien zur Pflegewissenschaft.
1095-6840|General and comparative endocrinology.
0036-973X|Scripta mercaturae
2286-038X|"Ovidius" University Annals of Chemistry.
1572-9567|International journal of thermophysics.
1224-581X|Analele stiintifice ale Universitatii "Al.I. Cuza" din Iasi
1570-0844|Semantic web.
0323-3391|Wissenschaftliche Beiträge des Instituts Ökonomik der Entwicklungsländer an der Hochschule für Ökonomie "Bruno Leuschner" Berlin.
0041-798X|Proceedings of the United States Naval Institute.
0333-5879|Frswm - Hdtknywn-Mkwn dTknwlwgy Lysr'l, Hfqwldth Lhndsh dHql'yt  ?"s Frwf' L'wdrmylq
0946-2104|Trace elements and electrolytes
2248-3276|Analele stiintifice ale Universitatii "Al.I. Cuza" din Iasi
2662-4176|Disaster research and management series on the global South.
0408-9936|Saopstenja Instituta za vodoprivredu "Jaroslav Cerni".
1552-3594|Criminal justice and behavior.
2537-3137|Annals of "Valahia" University of Târgoviste
1437-1073|Cardio-News.
1476-4679|Nature cell biology.
1382-4139|Attractor.
0776-0086|Bulletin de la Société royale belge d'études géologiques et archéologiques "Les chercheurs de Wallonie"
0983-0537|Journal de la marine marchande et du transport multimodal.
0005-5115|Buletin ekonomi suku tahunan - Bank Negara Malaysia
0174-8319|Archiv für Genossenschaftswesen
1095-9238|Journal of archaeological science.
sqlite>

(8) By Oliver Bandel (oliban) on 2022-02-16 21:25:24 in reply to 6.1 [link] [source]

OK, my input does not conform csv format.
I wrongly assumed that when I use '.mode csv' and set '.separator "\t"' that SQLite then parses tsv formatted input.
This assumption seems to be wrong.

In my printed documentation for the SQLite shell I did not found the 'tabs' mode, which might have served me better.
But I just tried it and got the same behaviour as with 'csv' mode.

The docs about csv, that were linked here in this thread by others did not explain tsv.


When I look into this tsv-format description here, I can't find doublequotes or other quoting mechanism mentioned:

https://www.iana.org/assignments/media-types/text/tab-separated-values

(10) By Ryan Smith (cuz) on 2022-02-16 23:12:11 in reply to 8 [source]

That is simply because SQLite does not have a "TSV" import format, it has only a CSV importer, and if you say please import this as CSV data, BUT use a TAB for the separator, it will happily substitute the Comma for a TAB, but it does not magically become a TSV importer, it still is a CSV importer playing by the CSV rule-book.

That said, I find it strange that TSV has different rules to CSV, so far for me I always expected TSV data conform exactly to CSV-rules but with the Comma changed to a TAB. It is interesting that this is not the case.

(12) By Larry Brasfield (larrybr) on 2022-02-17 11:24:15 in reply to 10 [link] [source]

I find it strange that TSV has different rules to CSV ...

"TSV" is a more nebulous term than even "CSV", in the big wide world. As far as the CLI's .import command works, it's just a different separator, as you say.

To get CLI output that can be brought back in via .import, with tab field separators, this is needed:.mode csv .separator "\t". Note that ".mode tabs" does not have the same effect on output.

(9.3) By Keith Medcalf (kmedcalf) on 2022-02-16 21:39:27 edited from 9.2 in reply to 6.1 [link] [source]

You will note that the format for a "separated values" file is as follows:

<field>FS<field>FS<field> ... RS
<field>FS<field>FS<field> ... RS
<field>FS<field>FS<field> ... RS
...
<field>FS<field>FS<field> ... RS

where FS = the Field Separator Character
and RS = the Record Separator Character

For each field, if (instr(field, RS)>-1 || instr(field, FS)>-1 || field[0]=='"') then the field MUST BE DIDDLED. Otherwise diddling is not required (though there are many formats of separated values files that do diddle with fields that do not require diddling. Applying extra field diddling where not required merely consumes resources but does not have any affect on the value because it will be de-diddled before use.

If the field required DIDDLING, then the following diddling should be performed and the diddled result used as the field in the separated file.

replace all double-quotes (") with two double-quotes side-by-each (""), then
prepend a double-quote (") and append a double-quote (")

On reading a "separated values" file, then if the FIRST CHARACTER of an input field (that is, the byte immediately following the RS or FS or beginning of file) is a double-quote (") then the read contents need to be DE-DIDDLED before they are used. Otherwise the field result value is the entire literal value contained between the separators.

If the field is DIDDLED (ie, the first byte is a double-quote (")) then the field value must be de-diddled. The first byte, which is a double-quote (") is removed. The last byte, which must be a double-quote (") is also removed. Any two byte sequence that contains two double-quotes side-by-each ("") is replaced with a single double-quote (").

It should be noted that if a field begins with a double-quote then it must also end with a double-quote. If this is not the case, then the result would be erroneous and the failure should be very noisy and very loud.

(11.1) By Harald Hanche-Olsen (hanche) on 2022-02-17 08:49:19 edited from 11.0 in reply to 9.3 [link] [source]

For each field, if (instr(field, RS)>-1 || instr(field, FS)>-1 || field[0]=='"') then the field MUST BE DIDDLED. Otherwise diddling is not required (though there are many formats of separated values files that do diddle with fields that do not require diddling.

I notice that sqlite does indeed handle unquoted fields with a double not at the beginning this way, but on export it diddles them:

▶▶▶ .sh cat tt.csv
a,b,c
1,"b c",d
2,e"f,g
▶▶▶ .import -csv tt.csv tt
▶▶▶ select * from tt;
┌───┬─────┬───┐
│ a │  b  │ c │
├───┼─────┼───┤
│ 1 │ b c │ d │
│ 2 │ e"f │ g │
└───┴─────┴───┘
▶▶▶ .mode csv
▶▶▶ select * from tt;
1,"b c",d
2,"e""f",g

RFC 4180 insists on diddling those fields, though:

If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.

So it is good to know that sqlite creates conforming csv, while being more liberal on input. But it could also be useful to know that the standard is indeed more strict. (I only point this out because of the frequent references to RFC4180 when the topic comes up. Similar comments apply to CRLF versus LF endings: I am happy that every csv reader that I know of, recognises the latter.)