null character sorts greater than 0x7f and less than 0x80

(1) By Poor Yorick (pooryorick) on 2020-08-01 08:19:38 [link] [source]

The output of the following script is 127 0 128 rather than the expected 0 127 128

#! /usr/bin/env tclsh
package require sqlite3
sqlite db :memory:

db eval {
    create table t (
        v text
    )
}

foreach val {
    \x80
    \x7f
    \x00
} {
    string range $val 0 end
    #puts [::tcl::unsupported::representation $val]
    db eval {
        insert into t values ($val)
    }
}

set sorted {}
db eval {
    select v from t order by v collate binary asc 
} {
    lappend list $v
    scan $v %c ord
    lappend sorted $ord
}

puts $sorted

The correct output would be 0 127 128, woulddn't it?

-- Yorick

(2) By Richard Hipp (drh) on 2020-08-01 18:40:41 in reply to 1 [link] [source]

Apparently TCL is encoding the x00 character as two bytes: 0xc0 0x80. You can see this if you do:

  puts [db eval {
    SELECT rowid, quote(CAST(v AS BLOB)) FROM t;
  }]

(3) By Tim Streater (Clothears) on 2020-08-01 19:41:07 in reply to 2 [link] [source]

That's non-standard UTF-8, as described here:

https://en.wikipedia.org/wiki/UTF-8

See the section on Overlong Encodings. The following section to that also indicates why overlong encodings are not a good idea.

(4) By Richard Damon (RichardDamon) on 2020-08-01 20:34:33 in reply to 3 [link] [source]

But also see Section 4.2 (Modified UTF-8) about why it is also a 'Standard' Variation on UTF-8. It allows the insertion of the NUL character into 0 terminated strings without the character being taken as the End of String code, which is what was attempted

(5) By Larry Brasfield (LarryBrasfield) on 2020-08-01 21:28:19 in reply to 4 [link] [source]

Modified UTF-8, which describes the modification as representation of the null byte as "the two-byte overlong encoding 11000000 10000000 (hexadecimal C0 80)", says that Tcl uses that overlong encoding for internal representation of Unicode data but uses strict CESU-8 (which "does not have the special encoding of the NUL character") for external data. If these Wikipedia-claimed facts are true, then I think that the TCL SQLite library should be storing the null byte, (however it may be represented internal within the TCL runtime engine), as a simple null byte when UTF-8 is the chosen encoding. For that reason, I think the OP's report indicates a SQLite bug because, in effect, it causes TCL with the SQLite library to deviate from that implied promise.

Of course, this is merely a semantic analysis; it says nothing about the merits of having the SQLite representation of a null byte act like a TCL internal representation.

(6) By Richard Damon (RichardDamon) on 2020-08-01 23:30:00 in reply to 5 [link] [source]

I am not sure it is SQLites's job to decide which variation(s) of UTF-8 to allow or if it requires the application to use strictly 'standard' UTF8 to allow. CESU-8 is a separate variation (encoding non-BMP characters with the encoding of the surrogates, instead of the one single codepoint0. That variation is basically orthogonal to Modified UTF-8, which deals only with the encoding of the code point 0. Maybe you could say that TCL is using "Modified CESU-8".

(7.1) By Larry Brasfield (LarryBrasfield) on 2020-08-02 00:06:10 edited from 7.0 in reply to 6 [link] [source]

If I have interpreted the cited Wikipedia article assertions correctly, Tcl uses the overlong null character representation internally but the canonical, single-byte representation for external data.

Maybe you could say that TCL is using "Modified CESU-8".

I might say that, if it was true. But it is not.

At least on Windows, Tcl 8.6 uses the overlong null representation in string variables for the \x00 character but writes it out (via puts) using the single-byte representation conforming to unmodified CESU-8. I have demonstrated this with Magicsplat Tcl/Tk for Windows, version 8.6.10, with stdout redirected to a file to avoid console weirdness.

I maintain that, as an alternative way of storing or retrieving "external data", SQLite should do the same translation between Modified UTF-8 internal data and either (plain) CESU-8 which uses the short, canonical null representation or standard UTF-8 which also uses the short representation for external data. [a]

[a. The difference between CESU-8 and UTF-8 lies in how code points beyond 0x00 to 0x7F are represented; it is immaterial to this discussion. ]

This would, in my opinion, make SQLite adhere more closely to Tcl's design intention. I suspect that the overlong representation is used because it simplifies Tcl's internal operations. If the representation never escapes the internal representation (except by purposeful, debug-like commands), then it is purely an implementation detail. I think the OP was rightfully surprised.

(8) By Richard Damon (RichardDamon) on 2020-08-02 00:56:10 in reply to 7.1 [link] [source]

If, for internal storage, TCL converts strings with embedded nulls into Modified UTF-8, so it can use text with embedded nulls without needing to do the needed workaround to actually force SQLite to handle the embedded nul in the string, that is TCL's business, and it is TCL's responsibility to convert it on reading.

If SQLite doesn't make the change going in, it shouldn't make the change coming out.

It doesn't matter if SQLite has a relationship with TCL, it is not dedicated to TCL, so should not be doing TCLs job for it, or it might break other applications' use of this same encoding trick. Why should SQLite be bent to meet some other language's design intent?

Now, one side effect of using this encoding is that the simple collation sequence the SQLite uses by default will now sort the value 0 between 0x7F and 0x80. If TCL thinks this error is significant, it can either not use the overlong encoding that causes the problem or provide a collation that handles it right. (Maybe it could be argued that SQLite could provide such a collation if the expense isn't that high).

(9) By anonymous on 2020-08-02 06:35:28 in reply to 8 [link] [source]

(Maybe it could be argued that SQLite could provide such a collation if the expense isn't that high)

Perhaps as an extension, which would be included with the Tcl interface (but can also be used independently if wanted). (Probably it need not be built in to SQLite.)

(11) By Larry Brasfield (LarryBrasfield) on 2020-08-02 16:13:39 in reply to 8 [link] [source]

If, for internal storage, TCL converts strings with embedded nulls into Modified UTF-8, so it can use text with embedded nulls without needing to do the needed workaround to actually force SQLite to handle the embedded nul in the string, that is TCL's business, and it is TCL's responsibility to convert it on reading.

I agree mostly, but you are only restating what I said earlier except for the "force SQLite" part. That is where things become interesting and less clear.

If SQLite doesn't make the change going in, it shouldn't make the change coming out.

That is either assertion of the obvious ("round-tripping should preserve data"), or begging the question. As I see it, the issue (that we are trying to get at) is whether SQLite (or any other Tcl library which is implemented such that it gets Tcl string objects in Tcl's internal form) should convert that internal form into the external (standard UTF-8) form on output and convert such external form back into the internal form on input. I think the answer is "clearly yes."

It doesn't matter if SQLite has a relationship with TCL, it is not dedicated to TCL, so should not be doing TCLs job for it, or it might break other applications' use of this same encoding trick. Why should SQLite be bent to meet some other language's design intent?

This gets to the crux of the matter, I think. I am not saying what SQLite should do generally except insofar as the principles [a] extend to other environments. The question here is what should the Tcl library known as sqlite3 do? For efficiency reasons, Tcl nowadays has internal forms rather than treating everything as strings. And also for efficiency reasons, SQLite for Tcl implements interfaces that operate below the ostensible "everything is a string" level, and so naturally those interfaces traffic in the internal forms. I believe that cross-language database compatibility considerations work to make it desirable for the Tcl SQLite library to perform internal/external form conversions. This has nothing to do with the origin of SQLite as a Tcl library and everything to do with its Tcl library build being a useful and unsurprising Tcl library.

[a. I do think that a SQLite library for a language where strings were represented as length-prefixed UTF-16 should translate those to/from UTF-8 where that is the chosen DB text encoding. ]

For backward compatibility reasons, if this issue is remedied in (the Tcl library form of) SQLite, it will probably require an option of some kind for the UTF-8 collation. Perhaps it should be named "Want_Standard_UTF-8".

(14) By Rowan Worth (sqweek) on 2020-08-03 05:28:55 in reply to 6 [link] [source]

I absolutely agree that SQLite's VDBE should not be interfering with or imposing restrictions on the encoding used by client(s).

But the tcl bindings are in a perfect position to encapsulate this tcl implementation detail, arguably they should be doing so. And since the bindings are also part of the SQLite distribution it's not unreasonable to say this is "SQLite's job".

(10) By anonymous on 2020-08-02 09:21:31 in reply to 5 [link] [source]

That's funny, because the second Wikipedia page you cited states explicitly that

It should be used exclusively for internal processing and never for external data exchange.

So if the other source is correct, Tcl's got it the wrong way round.

(13) By Larry Brasfield (LarryBrasfield) on 2020-08-02 23:32:19 in reply to 10 [link] [source]

I don't see how that is so. I carefully read both of the Wikipedia articles to which I linked, and also verified that Tcl acts as claimed. It uses the overlong nul form internally and outputs or inputs the canonical standard form for file (or stream) I/O. Here is some code with which that can be seen:package require sqlite3 sqlite3 db nullch.sdb db eval { create table if not exists t ( id integer primary key, k integer, v text ) } set ofh [open nullch.txt w] # degree, rubout, nul foreach val { ° x7f x00 } { puts $ofh $val db eval { insert into t (k,v) values (1, $val) } } close $ofh set ifh [open nullch.txt] while {[gets $ifh lin] >= 0} { db eval { insert into t (k,v) values (2, $lin) } } close $ifh db close Examination of the nullch.txt and nullch.sdb files is left as an exercise.

(15) By anonymous on 2020-08-03 06:52:11 in reply to 13 [link] [source]

Sorry, my post was a bit short and unclear. What I meant is that CESU-8 is a non-standard form that may be used for internal representations but should not be used in external communication (according to the Wikipedia page on CESU-8, which also states that HTML expressly disallows CESU-8). So Tcl should convert to a standard encoding for external communication and use what ever it pleases internally.

For me, modified UTF-8 seems much closer to the standard (because it differs only in the non-canonical encoding of 0x00) and I felt that it would be better for Tcl to do things the other way round. But of course it's just as non-standard as the other format.

(16) By Richard Damon (RichardDamon) on 2020-08-03 12:01:55 in reply to 15 [link] [source]

Personally, I sort of feel the opposite. CESU-8 exists because some platforms adopted UCS-2 when it was a thing, and when Unicode expanded past 16 bits, and UCS-2 sort of morphed into UTF-16, applications that still were thinking UcS-2 would generate CESU-8 for non-BMP characters, seeing them as a funny thing built out of two of the things it thought of as characters. Use of CESU-8 is less likely to cause a security issue for an application that generates CESU-8, as such an application will likely be processing internally as UTF-16, and the sort of issues with CESU-8 (that code points have multiple spellings) would tend not to do much processing on the stream at the CESU-8/UTF-8 encoded phase. Applications that process as UTF-8, would be advised to either reject CESU-8 or clean it up very early in processing.

Modified UTF-8 on the other hand, comes about from wanting to have embedded nuls in strings that are defined as nul terminated. The problem here is that ay step that converts the format to UCS-32 or normalizes the string, will suddenly find the string shortened. Any string that is capable of holding a nul character should not be stored as a nul terminated string but as a counted string. Within an application, if you have defined that you are going to be using the technique is one thing, but then using this method on an external data that isn't defined to do so is asking for problems.

It should be noted, that in this case, SQLite is designed to be able to handle strings with embedded nuls, in them, you just need to be providing it with the full length of the string, This means that SQLite doesn't need this sort of trickery if the applications that use it support strings with embedded nuls, and use the right API for them, and it also supports the use of Modified UTF-8 for those applications that want to handle the issue that way, as SQLite internally processes strings as having a specified length, it just allows applications to pass strings to it that are nul terminated and ask SQLite to figure out their length.

(17) By Peter da Silva (resuna) on 2021-03-23 23:58:28 in reply to 15 [link] [source]

Tcl does have functions for converting to a standard encoding. It's just that very few packages, apparently including sqlite, use them.

I'm in the process of updating Pgtcl to use them.

https://www.tcl.tk/man/tcl8.6/TclLib/Encoding.htm

In my case I discovered it was converting "mathematical fraktur capital a" into "ed a0 b5 ed b4 84" instead of "F0 9D 94 84" and PostgreSQL said "ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xb5". Passing through "Tcl_UtfToExternalDString(utf8encoding..." fixed it right up.

(18) By Richard Damon (RichardDamon) on 2021-03-24 00:52:42 in reply to 17 [link] [source]

In my case I discovered it was converting "mathematical fraktur capital a" into "ed a0 b5 ed b4 84" instead of "F0 9D 94 84" and PostgreSQL said "ERROR: invalid byte sequence for encoding "UTF8": 0xed 0xa0 0xb5". Passing through "Tcl_UtfToExternalDString(utf8encoding..." fixed it right up.

That looks like someone naively converted UTF-16 to UTF-8 and didn't handle the Surrogate pairs properly. The Unicode Standard requires that Surrogate pairs be processed by building the actual character and encoding that, not just the characters of the pairs individually. Many programs don't do the test right, and the need to test is somewhat what sort of security model you are trying to enforce, but PostgreSQL is right in rejecting it.

(12) By Poor Yorick (pooryorick) on 2020-08-02 20:06:39 in reply to 1 [link] [source]

In a nutshell, tclsqlite.c should use Tcl_UtfToExternal on incoming values to convert the value from the Tcl internal utf-8 form to standard utf-8.

(19.1) By Poor Yorick (pooryorick) on 2021-08-02 12:16:47 edited from 19.0 in reply to 12 [link] [source]

Until this issue is fixed, data stored in an SQLite database whose encoding is ostensibly utf-8 is not necessarily actually valid utf-8. Rather, a value stored in the data base may be in Tcl's own internal representation. This will work mostly fine (except for the sorting issue) as long as the database is only accessed through the Tcl interface, but other systems accessing the database without the Tcl interface will get munged results.

(20.2) By Scott Robison (casaderobison) on 2021-08-02 16:42:10 edited from 20.1 in reply to 19.1 [source]

This is because UTF-8 in SQLite is actually "a stream of bytes". If you put "garbage" into a UTF-8 field (such as the common "Modified UTF-8" which encodes NUL as C0 80) SQLite will dutifully store it and return it back to you as long as you ask for it in UTF-8. Or for that matter, you can store any encoding that does not embed a null in a UTF-8 text field, you are just responsible for knowing what to do with it in the end. Such is the nature of SQLite text fields.

One could write a user defined function, say "valid_utf8", that verifies a field is truly UTF-8 and use it in check constraints if they really want to guarantee that no ill formed UTF-8 is injected. At this point SQLite couldn't (so it seems to me) change how it stores UTF-8 text fields without breaking someone.

(21) By Keith Medcalf (kmedcalf) on 2021-08-02 20:16:30 in reply to 20.2 [link] [source]

One could write a user defined function, say "valid_utf8", that verifies a field is truly UTF-8 and use it in check constraints

The sequence C0 80 is perfectly valid UTF-8 encoding.
That you may not like that fact is immaterial.

(22) By RandomCoder on 2021-08-02 20:31:23 in reply to 21 [link] [source]

I had thought that's no longer true:

Before the Unicode Standard, Version 3.1, the problematic "non-shortest form" byte sequences in UTF-8 were those where BMP characters could be represented in more than one way. These sequences are ill-formed, because they are not allowed by Table 3-7.

The example in the spec specifically calls out "C0" as an invalid first byte in a sequence:

The byte sequence C0 AF is ill-formed, because C0 is not well-formed in the "First Byte" column.

Or, is C0 80 as a replacement for 00 a special case?

(23) By Tim Streater (Clothears) on 2021-08-02 20:47:59 in reply to 21 [link] [source]

Errm, no. From the Wikipedia article on UTF-8:

https://en.wikipedia.org/wiki/UTF-8

"The standard specifies that the correct encoding of a code point uses only the minimum number of bytes required to hold the significant bits of the code point. Longer encodings are called overlong and are not valid UTF-8 representations of the code point. This rule maintains a one-to-one correspondence between code points and their valid encodings, so that there is a unique valid encoding for each code point. This ensures that string comparisons and searches are well-defined."

The minimum number of bytes to encode a null is one.

(24) By Keith Medcalf (kmedcalf) on 2021-08-02 21:09:28 in reply to 23 [link] [source]

That refers to "encoding", not "decoding".

That a particular sequence of code may produce the same result upon decoding by the same decoder and that the choice of sequence is up to the encoder, means that the "encoder" has a problem.

The fact that a sequence B when decoded using method X produces output A (which is not distinguishable from the original input) proves that B is a valid encoding of A. The decoder cannot possibly know that fact unless it re-encodes A with a different compliant encoder X1 to produce sequence B1 different from B. This is a verification of the contents of the encoded text rather than merely the syntax of the encoding.

It is not the job of the decoder to validate the contents of the stream it is processing.

(25) By Larry Brasfield (larrybr) on 2021-08-02 21:32:11 in reply to 24 [link] [source]

I believe Tim was responding to your assertion, "The sequence C0 80 is perfectly valid UTF-8 encoding."

It does follow a mechanistic application of the encoding rules for multi-byte sequences. But the standard cited in the Wikipedia article and the objective of keeping a one-to-one mapping between code points and UTF-8 encodings are violated and defeated by use of 0xC0 0x80 to encode 0x00. The standard clearly calls for avoidance of "overlong" encodings, in service of the unique mapping objective, and 0xC0 0x80 is clearly in that category.

This point is unrelated to how people might feel about that standard.

(26) By Scott Robison (casaderobison) on 2021-08-02 22:49:30 in reply to 25 [link] [source]

(27) By Scott Robison (casaderobison) on 2021-08-02 22:53:16 in reply to 21 [link] [source]

No, it is an overlong sequence which is explicitly forbidden by the UTF-8 standard. As are any attempts at encoding any character with a code point above U+10FFFF.

BUT! Those are potentially useful encodings, and I think the SQLite library's policy of allowing arbitrary streams of bytes (that don't include NUL) to be inserted into a TEXT column is useful to many people, and to suddenly begin enforcing strict UTF-8 compliance would be a disservice to the community. Especially when people who want strict UTF-8 compliance can achieve it in other ways because of just how flexible SQLite is.

(28) By ddevienne on 2021-08-03 06:59:01 in reply to 27 [link] [source]

I don't disagree, but at the same time, since SQLite already has code for UTF-8,
it would be useful to have a is_utf8() built-in function for those that want to opt-in
to strict UTF-8 enforcement using a CHECK constraint.

(34) By Scott Robison (casaderobison) on 2021-08-03 14:20:27 in reply to 28 [link] [source]

I said much the same in different words above. It would be nice for SQLite to expose that, but anyone who needs that can do it today.

(29) By Poor Yorick (pooryorick) on 2021-08-03 08:40:05 in reply to 27 [link] [source]

Where is this policy of accepting an arbitrary sequence of bytes for the "text" storage class stated? Doesn't it contradict SQLite's own documentation, which provides the following definition of the "text" storage class?

Text. The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE)

Isn't the whole point of the "blob" storage class to have a place to hold values that aren't valid for the "text" class? Wouldn't the appropriate interpretation of a value that is not valid in the encoding of the database then be "blob"? Why even have a "text" storage class if that's not what it is? Alternatively, why have a "blob" storage class if "text" accepts arbitrary values?

So SQLite doesn't allow the null character in a text value? If so, that means SQLite is not capable of storing all possible Unicode strings. I had supposed this was one of the big advantages of SQLite over, i.e. PostgreSQL.

(30) By Gunter Hick (gunter_hick) on 2021-08-03 09:32:55 in reply to 29 [link] [source]

TEXT means you are supposed to store valid utf-whatever and SQLite will return exactly what you stored, up to and including the first NUL character (in the original meaning of a single byte with value zero). You can also request translation to some other utf encoding based on the premise that what you stored corresponds to the label you provided.

BLOB means you can store anything and SQLite will return exactly what you stored.

If you store something as BLOB and request a cast to TEXT, SQLite will return everything up to and including the first NUL character (see above), or, if there is no NUL character, append one.

You are storing something that is NOT VALID utf-8, but MUTF-8 - probably from Java - in a box labeled utf-8 and then complaining that it does not do what you expect.

If you insist on storing MUTF-8 then use a blob field, although I don't expect it will change the sort order - BLOB fields sort by the byte representation, so since (UTF8 for ASCII 0x7F) 0x7Fnn < 0xC080 < 0xC280 (UTF-8 for ASCII 0x80) the sort order will not change.

Solving the ordering problem will require writing and loading your own collation function.

(32) By Poor Yorick (pooryorick) on 2021-08-03 10:27:23 in reply to 30 [link] [source]

I don't insist on anything, and the Tcl script in question is not doing anything other than using the interface distributed along with SQLite to access a database and store values using variable substitutions performed by SQLite. I'm working in a language where every value is a Unicode string, and am storing these strings the database. I'd like these stored values to be useful regardless of which SQLite interface is used on which platform to access the database.

Neither the the SQLite documentation for the Tcl interface nor the SQlite documentation itself provides any clue that some Tcl internal non-standard utf representation is what ends up stored as "text", and there's a reason for that: It's a bug. The issue here is that the SQLite Tcl interface is not doing its job. It is the component that is in the position to translate the Tcl internal representation of a value into standard utf-8. The Tcl C API provides routines to do that.

If text strings in SQLite may not include the null character, that should be clearly documented, and the definition of "text" that is not true should be replaced with one that is.

(33) By anonymous on 2021-08-03 11:59:50 in reply to 32 [link] [source]

http://www.sqlite.org/c3ref/bind_blob.html

states:

If any NUL characters occurs at byte offsets less than the value of the fourth parameter [of sqlite_bind_textXXX functions] then the resulting string value will contain embedded NULs. The result of expressions involving strings with embedded NULs is undefined.

An old post from 2016 https://www.mail-archive.com/sqlite-users@mailinglists.sqlite.org/msg98258.html shows that some builtin function treat embedded nul chars as end of string. Others do not.

I guess it is safe to store and retrieve text with embedded nuls. But not to "calculate" with them.

(35) By Scott Robison (casaderobison) on 2021-08-03 14:24:10 in reply to 29 [link] [source]

It may violate the documentation. I don't have a chapter and verse that documents otherwise. But I suspect that given the value the SQLite team places on backward compatibility, it is unlikely that this "bug" or "feature" (depending on your point of view) is going to change, as it would break untold numbers of applications that rely on the functionality. The same reason the fact that INTEGER PRIMARY KEY NULL handling has never been changed, or by default FOREIGN KEY is off...

(36) By Tim Streater (Clothears) on 2021-08-03 17:15:12 in reply to 29 [link] [source]

"I had supposed this was one of the big advantages of SQLite over, i.e. PostgreSQL."

Perhaps you meant e.g. instead of i.e.

i.e. - that is
e.g. - for example

(37) By Poor Yorick (pooryorick) on 2021-08-03 18:01:29 in reply to 36 [link] [source]

Yes, I did.

(31) By Rowan Worth (sqweek) on 2021-08-03 10:10:00 in reply to 12 [link] [source]

With the amount of time spent so far explaining the internal workings of sqlite's utf8 handling (which is not under fire in this year-old thread), I estimate that the tcl sqlite extension could have been patched at least 172 times so that it no longer inserts modified utf8.

The only complication is that changing the serialisation format may constitute a breaking change to the extension's API, or at least a weird subtle surprise for any tcl users who have existing databases that do containing modified utf8 strings with embedded NUL characters, as a round trip from DB(modified utf8) -> tcl -> Tcl_UtfToExternal -> DB(standard utf8) will change the length of such data cells.

Then there's the question of what happens when someone runs an old version of the tcl extension against a DB containing the new serialisation format...