CLI's .dump and ASCII vs UTF8

(1) By ddevienne on 2021-11-08 09:15:56 [link] [source]

Hi. Just noting something perhaps a bit misleading about the doc of .dump.

It says ASCII, but .dump seems to produce UTF instead. (UTF-8 in my case, didn't try with a UTF-16 DB)

I searched for utf and encoding in the CLI's doc, and found no hit, so not much on international chars there either.

We just discovered invalid UTF chars in an SQLite DB of ours, when trying to import it in PostgreSQL,
which unlike SQLite enforces UTF-8 for text typed columns. Which makes me think it might be nice to have
a basic QC check pragma for UTF*-ness* of all text-typed values in a DB, BTW.

I also just discovered (fortuitously) that length(text-col) is UTF-aware, which surprised me,
so obviously there's already code in SQLite to check all bytes have the bit-patterns of UTF-8 encoded code-points.

Below is the session log of my small experimentation. More details on how .dump handles international characters might be nice to have in the doc. My $0.02 :)

Create a DB with an UTF codepoint (degree symbol):

C:\Users\ddevienne>sqlite3 degree.db
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite> create table t (v text);
sqlite> insert into t values (char(176));
sqlite> select v, hex(v), length(v) from t;
°|C2B0|1
sqlite> .exit

.dump it, and recreate a DB from the dump:

C:\Users\ddevienne>sqlite3 degree.db .dump > degree.sql

C:\Users\ddevienne>sqlite3 degree2.db < degree.sql

Look at the recreated DB:

C:\Users\ddevienne>sqlite3 degree2.db
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite> select v, hex(v), length(v) from t;
°|C2B0|1
sqlite> select v, hex(v), length(v), length(cast (v as blob)) from t;
°|C2B0|1|2
sqlite> .exit

(Notice the difference between length() for the value as text or blob)

Look at the bytes of the .dump output, to verify 0xC2B0 (i.e. the UTF-8 encoding of that codepoint) was used in the file, instead of pure ASCII (that would use char(176) for example):

C:\Users\ddevienne>powershell
Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

PS C:\Users\ddevienne> Format-Hex degree.sql
           Path: C:\Users\ddevienne\degree.sql
           00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000   50 52 41 47 4D 41 20 66 6F 72 65 69 67 6E 5F 6B  PRAGMA foreign_k
00000010   65 79 73 3D 4F 46 46 3B 0D 0A 42 45 47 49 4E 20  eys=OFF;..BEGIN
00000020   54 52 41 4E 53 41 43 54 49 4F 4E 3B 0D 0A 43 52  TRANSACTION;..CR
00000030   45 41 54 45 20 54 41 42 4C 45 20 74 20 28 76 20  EATE TABLE t (v
00000040   74 65 78 74 29 3B 0D 0A 49 4E 53 45 52 54 20 49  text);..INSERT I
00000050   4E 54 4F 20 74 20 56 41 4C 55 45 53 28 27 C2 B0  NTO t VALUES('Â°
00000060   27 29 3B 0D 0A 43 4F 4D 4D 49 54 3B 0D 0A        ');..COMMIT;..
PS C:\Users\ddevienne>

(2.1) By Larry Brasfield (larrybr) on 2021-11-08 12:48:08 edited from 2.0 in reply to 1 [link] [source]

You have made a convincing case that said doc should say "UTF-8" rather than "ASCII". Please clarify if there is some additional problem brought to light by your post. (I gather 1 error and 1 feature suggestion from it.)

BTW, ASCII is strictly a 7-bit coding scheme, so "°" cannot be encoded in it with any small number of bytes.

(3) By Larry Brasfield (larrybr) on 2021-11-08 14:07:33 in reply to 1 [source]

[The .dump doc] says ASCII, but .dump seems to produce UTF instead.

This has been fixed. Thanks.

[I]t might be nice to have a basic QC check pragma for UTF*-ness* of all text-typed values in a DB, BTW.

Yes, it might. This is among the arguments to be made for shell extensibility, IMO.

More details on how .dump handles international characters might be nice to have in the doc. My $0.02 :)

For most platforms, .dump output is fprintf() output. I think little needs to be said about that since it is either uninteresting or strictly a platform quirk.

For Windows, .dump output is little more than fprintf() unless output is going to the console. In that case, (as a special dispensation to that quirky but common destination), output is translated as needed to MBCS because that console handles UTF-8 poorly. I'm not sure how much should be said about this, if anything, because it will rarely apply to output used programmatically.

(4) By ddevienne on 2021-11-08 15:41:38 in reply to 3 [link] [source]

This has been fixed

I don't see it on the live site, it's probably on Draft only. I didn't check the timeline.

arguments to be made for shell extensibility

Pragmas are not in the shell, there are in SQLite core.

UTF-ness of text values is one of those things SQLite assumes but is permissive about.
Similar to how CHECK constraints can be disabled (but there's a pragma to check them),
or how FOREIGN KEY constraints are not enforced by default (but there's a pragma, several in fact, to check them).

IMHO, UTF-ness is in the same category. As such, there should be a pragma to
check basic UTF-8 (and UTF-16) of text values, which checks all bytes have valid
bit-pattern for UTF-8, that 4th bytes are followed by 3rd bytes, 2nd types, etc...
And possibly that code-points are in valid ranges (or not in invalid ranges rather), and/or that surrogate pairs are properly paired.

(UTF-8 is designed such that all bytes of a particular multi-byte codepoint have different high-order bits, but many of us know that).

As I wrote, the built-in SQL length() is UTF-aware already, so the UTF rules I mention above are already in SQLite's code base.
So the pragma I'm hoping for is using existing pragmas to craft queries that return all values
(since SQLite is flexibly-typed, we can't restrict ourselves to just text-typed columns...), and scan the bytes for UTF-ness.

I know, easier said that done, especially to 100% MC/DC coverage, since in SQLite's core.
I'm merelly thinking aloud, and I'm used to being ignored on this forum by now.

I'm not sure how much should be said about this

Well, given the above, the encoding of the .dump output in not even UTF, it is whatever the text values contain...
Because again, UTF-ness is not enforced in SQLite. So it's even more ambiguous than ASCII or UTF-8.

At the very least, there should be warnings about the possibility of that output to be in no defined encoding at all,
since depends on text values themselves, which can be anything basically.
(the line between text and blob is tenuous in SQLite, while it's not in other DBs)

Also there won't be a BOM if uses non-ASCII multi-byte UTF-8 codepoint.

An option to .dump to force ASCII output, which char(X) for (valid) UTF-8 codepoints would be nice.
I don't recall if there's a way to escape/encode arbitrary bytes in text literals (x'...' implies a blob value if I recall correctly).
.dump is at least specific to the CLI, so not subject to 100% MD/DC :)

(5) By Larry Brasfield (larrybr) on 2021-11-08 16:59:44 in reply to 4 [link] [source]

This has been fixed

I don't see it on the live site, it's probably on Draft only. I didn't check the timeline.

It's in the docs repository, to be published in due course. (Minor changes are not normally published immediately.)

I'm not sure how much should be said about this

Well, given the above, the encoding of the .dump output in not even UTF, it is whatever the text values contain...
Because again, UTF-ness is not enforced in SQLite. So it's even more ambiguous than ASCII or UTF-8.

The .dump output data are returns from sqlite3_column_text(...), which are nominally UTF-8 for differing overall DB text encodings. Here, "nominally" means when text inputs have consisted of valid code point sequences. For DB encodings other than SQLITE_UTF8, those returns will be translated from the raw stored content.

I don't think it should be .dump's job (or option) to somehow filter invalid UTF-8 text that has been stored by users. It's job is simple, restricted to producing a text representation of the existing DB.

Also there won't be a BOM if uses non-ASCII multi-byte UTF-8 codepoint.

I suppose a --bom option might make sense for the .dump command. However, given that its output will normally be either subject to much further processing or fed directly to sqlite3, I do not yet see real value or necessity for that.

[general point that invalid code point treatment should be documented]

The fact that SQLite stores text as given to it, with possible translation between encoding schemes but no rejection of technically invalid code points, deserves mention somewhere. My hesitation is against cluttering the docs with this level of detail wherever the terms "UTF-8" or "UTF-16*" are used. I will find a place to mention this behavior, and maybe link to it where use of those terms might lead to false expectations.

... there should be a pragma to check basic UTF-8 (and UTF-16) of text values, which ...
I'm merely thinking aloud, and I'm used to being ignored on this forum by now.

This deserves a separate thread, and attention for a post-3.37 release.