csv export not encoding as UTF-8 by default
(1) By RyanBee on 2022-02-07 15:46:26 [link] [source]
Greetings!
I have my database encoded in UTF-8. When viewing the database contents, such as via SQLite Studio, the name "Examplé" (it's not actually that but it may as well be) shows fine. When exporting to CSV using SQLite Studio, and specifying the encoding as UTF-8, it reads fine. The exported CSV shows "Examplé" as well.
However, when using the following command (albeit with a few minor redactions) in a powershell script, the resultant CSV shows "Examplé".
C:DatabaseSQLitesqlite3.exe -header -csv {Database Path} "Select * from {TableName};" > $ExportPath
In both cases the output is being viewed in Notepad. Is there a way to specify the output encoding when calling the program in this way? Why is it not defaulting to UTF-8 anyway, as that's the encoding of the table, showing when running a "pragma encoding" query? Any assistance which can be rendered would be greatly appreciated.
(2.1) By Gunter Hick (gunter_hick) on 2022-02-07 16:21:50 edited from 2.0 in reply to 1 [source]
The second rendering looks like a UTF-8 in an ISO environment. Are the files actually different? Try looking at the hex dump. Edit: lowercase e acute is Codepoint C3 A9 in UTF-8, which translates to "capital A tilde" followed by "copyright sign" in ISO
(3) By Larry Brasfield (larrybr) on 2022-02-07 23:28:46 in reply to 1 [link] [source]
This is a long post showing how I got to the bottom of the OP's trouble. It is detailed to permit replication of results and show the evidence for what I contend is the problem and justify the (far) below work-around.
In a trust-worthy, UTF-8 conversant editor, prepare utf8_stuff.csv:
"é",".i emacs edit"
"p1","cp from post1"
"ce","cp from emacs"
"lc","blob-lit cast"
In and interactive sqlite3 session, on PowerShell:
PS> sqlite3 utf8.sdb
, enter this with hand-edits where indicated:
sqlite> # Copy and paste e-acute from post #1 into the "paste here":
sqlite> .parameter set $tv_p1 "'paste here'"
sqlite> # Copy and paste e-acute from trusty utf8 editor into the "paste here":
sqlite> .parameter set $tv_ce "'paste here'"
sqlite> # Be sure the quoting was preserved.
Then, .read in or paste into the same session:
sqlite> create table t(tvalue text, provenance text);
sqlite> .import --csv utf8_stuff.csv t
sqlite> update t set tvalue=cast(x'c3a9' as text) where tvalue='lc';
sqlite> update t set tvalue=$tv_p1 where tvalue='p1';
sqlite> update t set tvalue=$tv_ce where tvalue='ce';
sqlite> # Save some output for later comparison:
sqlite> .output ps_shout_utf8.txt
sqlite> .output
sqlite> # Show DB content on console for grins.
sqlite> select * from t;
sqlite> .quit
Back in the PowerShell Core session:
PS> sqlite3 utf8.sdb "select * from t;"
é|.i emacs edit
?|cp from post1
?|cp from emacs
é|blob-lit cast
PS> # Above ? is what we should expect from mystery clipboard transfers.
PS> # The BLOB literal cast to text serves as a check on the editor output.
PS> # Now, see what happens with PowerShell redirect to file:
PS> sqlite3 utf8.sdb "select * from t;" > ps_redir_utf8.txt
PS> type .ps_redir_utf8.txt
é|.i emacs edit
é|cp from post1
é|cp from emacs
é|blob-lit cast
PS> # This strange reversal of "rightness" vs "wrongness" is a clue.
From a Linux (under WSL) shell:
$ ./sqlite3 utf8.sdb "select * from t;"
é|.i emacs edit
�|cp from post1
�|cp from emacs
é|blob-lit cast
$ # So far, so good.
$ cat ps_redir_utf8.txt
é|.i emacs edit
é|cp from post1
é|cp from emacs
é|blob-lit cast
$ # Looks like the PS created file from an output redirect is funky,
$ # although the path (browser to clipboard to sqlite3 interactive input
$ # to query display to PowerShell redirector to cat to display rendering)
$ # seems to have straightened out. (It's not e-accent in the DB.)
$ # This will be further proven below, with a hex dump.
$ ./sqlite3 utf8.sdb
sqlite> .output uw_shout_utf8.txt
sqlite> select * from t;
sqlite> .output
sqlite> .quit
$
$ cat uw_shout_utf8.txt
é|.i emacs edit
�|cp from post1
�|cp from emacs
é|blob-lit cast
$ # Fine. The funky content (pasted in from clipboard) content is shown funky.
$ ./sqlite3 utf8.sdb "select * from t;"
é|.i emacs edit
�|cp from post1
�|cp from emacs
é|blob-lit cast
$ ./sqlite3 utf8.sdb "select * from t;" > uw_redir_utf8.txt
$ cat uw_redir_utf8.txt
é|.i emacs edit
�|cp from post1
�|cp from emacs
é|blob-lit cast
$ # Still fine. (Unsurprising since no encoding/decoding is happening.)
$
$ # Dump of redirected sqlite3 query output from Linux Bash:
$ ./sqlite3 utf8.sdb "select * from t;" | od -t c -t x1
0000000 303 251 | . i e m a c s e d i t
c3 a9 7c 2e 69 20 65 6d 61 63 73 20 65 64 69 74
0000020 n 202 | c p f r o m p o s t 1
0a 82 7c 63 70 20 66 72 6f 6d 20 70 6f 73 74 31
0000040 n 202 | c p f r o m e m a c s
0a 82 7c 63 70 20 66 72 6f 6d 20 65 6d 61 63 73
0000060 n 303 251 | b l o b - l i t c a s
0a c3 a9 7c 62 6c 6f 62 2d 6c 69 74 20 63 61 73
0000100 t n
74 0a
0000102
See whether aforeseen goofiness is a Windows or PowerShell issue.
In a Windows 10 cmd.exe session:
% rem Dump of redirected sqlite3 query output from cmd.exe
%.sqlite3.exe utf8.sdb "select * from t;" | od -t c -t x1
0000000 A c | . i e m a c s e d i t
c3 a9 7c 2e 69 20 65 6d 61 63 73 20 65 64 69 74
0000020 r n , | c p f r o m p o s t
0d 0a 82 7c 63 70 20 66 72 6f 6d 20 70 6f 73 74
0000040 1 r n , | c p f r o m e m a
31 0d 0a 82 7c 63 70 20 66 72 6f 6d 20 65 6d 61
0000060 c s r n A c | b l o b - l i t
63 73 0d 0a c3 a9 7c 62 6c 6f 62 2d 6c 69 74 20
0000100 c a s t r n
63 61 73 74 0d 0a
0000106
% rem Looks fine.
The immediately preceding dumps are "correct", reflecting the DB content.
Let's see what the PowerShell redirected output looks like (from Bash):
$ od -t c -t x1 ps_redir_utf8.txt
0000000 342 224 234 342 214 220 | . i e m a c s
e2 94 9c e2 8c 90 7c 2e 69 20 65 6d 61 63 73 20
0000020 e d i t r n 303 251 | c p f r o m
65 64 69 74 0d 0a c3 a9 7c 63 70 20 66 72 6f 6d
0000040 p o s t 1 r n 303 251 | c p f r
20 70 6f 73 74 31 0d 0a c3 a9 7c 63 70 20 66 72
0000060 o m e m a c s r n 342 224 234 342 214 220
6f 6d 20 65 6d 61 63 73 0d 0a e2 94 9c e2 8c 90
0000100 | b l o b - l i t c a s t r n
7c 62 6c 6f 62 2d 6c 69 74 20 63 61 73 74 0d 0a
0000120
The redirected file content has had plenty of "helpful" translation done to it. Unfortunately, even though there is a highly flexible "Out-File" cmdlet in PowerShell for output redirection, with a variety of encoding options, it has no such option whose proper name would be "DoNotTranslateInAnyWayShapeOrForm". Or it might be named "Utf8ToUtf8". (Perhaps somebody will contribute such a cmdlet, or has published one somewhere.)
In Powershell, I used this workaround to get an unmolested result:
PS> sqlite3 utf8.sdb -cmd '.output psoc_utf8.txt' -cmd "select * from t;" -cmd .output -cmd .quit
Since that exploits a recent, unreleased improvement, this can be used to do the same with released SQLite CLIs:
echo .quit | sqlite3 utf8.sdb -cmd '.output psoc_utf8.txt' -cmd "select * from t;" -cmd .output
The bottom line is that, as the OP almost said and demonstrated, output redirection in PowerShell does not play well with the SQLite shell's UTF-8 output. I have shown, per above and by other means to my own conviction, that the CLI in fact emits UTF-8 to its stdout stream when its output has been redirected (to something other than a character device, such as an open-for-write file handle or pipe.)
Whether the SQLite CLI should anticipate bizarre translation by PowerShell's redirection mechanism (or stock cmdlet) is an open question. It already does translation for output to a Windows console; that was necessary to make it useful interactively on that platform.
(4) By anonymous on 2023-08-21 11:13:52 in reply to 3 [link] [source]
Thank you very much for this! Especially the work arounds in the end of your post was usable!
/B 🙂