SQLite Forum

null character sorts greater than 0x7f and less than 0x80
Login

null character sorts greater than 0x7f and less than 0x80

(1) By Poor Yorick (pooryorick) on 2020-08-01 08:19:38 [link] [source]

The output of the following script is 127 0 128 rather than the expected 0 127 128

#! /usr/bin/env tclsh
package require sqlite3
sqlite db :memory:

db eval {
    create table t (
        v text
    )
}

foreach val {
    \x80
    \x7f
    \x00
} {
    string range $val 0 end
    #puts [::tcl::unsupported::representation $val]
    db eval {
        insert into t values ($val)
    }
}

set sorted {}
db eval {
    select v from t order by v collate binary asc 
} {
    lappend list $v
    scan $v %c ord
    lappend sorted $ord
}

puts $sorted

The correct output would be 0 127 128, woulddn't it?

-- Yorick

(2) By Richard Hipp (drh) on 2020-08-01 18:40:41 [link] [source] in reply to 1

Apparently TCL is encoding the x00 character as two bytes: 0xc0 0x80. You can see this if you do:

  puts [db eval {
    SELECT rowid, quote(CAST(v AS BLOB)) FROM t;
  }]

(3) By Tim Streater (Clothears) on 2020-08-01 19:41:07 [link] [source] in reply to 2

That's non-standard UTF-8, as described here:

https://en.wikipedia.org/wiki/UTF-8

See the section on Overlong Encodings. The following section to that also indicates why overlong encodings are not a good idea.

(4) By Richard Damon (RichardDamon) on 2020-08-01 20:34:33 [link] [source] in reply to 3

But also see Section 4.2 (Modified UTF-8) about why it is also a 'Standard' Variation on UTF-8. It allows the insertion of the NUL character into 0 terminated strings without the character being taken as the End of String code, which is what was attempted

(5) By Larry Brasfield (LarryBrasfield) on 2020-08-01 21:28:19 [link] [source] in reply to 4

Modified UTF-8, which describes the modification as representation of the null byte as "the two-byte overlong encoding 11000000 10000000 (hexadecimal C0 80)", says that Tcl uses that overlong encoding for internal representation of Unicode data but uses strict CESU-8 (which "does not have the special encoding of the NUL character") for external data. If these Wikipedia-claimed facts are true, then I think that the TCL SQLite library should be storing the null byte, (however it may be represented internal within the TCL runtime engine), as a simple null byte when UTF-8 is the chosen encoding. For that reason, I think the OP's report indicates a SQLite bug because, in effect, it causes TCL with the SQLite library to deviate from that implied promise.

Of course, this is merely a semantic analysis; it says nothing about the merits of having the SQLite representation of a null byte act like a TCL internal representation.

(6) By Richard Damon (RichardDamon) on 2020-08-01 23:30:00 [link] [source] in reply to 5

I am not sure it is SQLites's job to decide which variation(s) of UTF-8 to allow or if it requires the application to use strictly 'standard' UTF8 to allow. CESU-8 is a separate variation (encoding non-BMP characters with the encoding of the surrogates, instead of the one single codepoint0. That variation is basically orthogonal to Modified UTF-8, which deals only with the encoding of the code point 0. Maybe you could say that TCL is using "Modified CESU-8".

(7) By Larry Brasfield (LarryBrasfield) on 2020-08-02 00:04:47 and edited on 2020-08-02 00:06:10 [history] [link] [source] in reply to 6

If I have interpreted the cited Wikipedia article assertions correctly, Tcl uses the overlong null character representation internally but the canonical, single-byte representation for external data.

Maybe you could say that TCL is using "Modified CESU-8".

I might say that, if it was true. But it is not.

At least on Windows, Tcl 8.6 uses the overlong null representation in string variables for the \x00 character but writes it out (via puts) using the single-byte representation conforming to unmodified CESU-8. I have demonstrated this with Magicsplat Tcl/Tk for Windows, version 8.6.10, with stdout redirected to a file to avoid console weirdness.

I maintain that, as an alternative way of storing or retrieving "external data", SQLite should do the same translation between Modified UTF-8 internal data and either (plain) CESU-8 which uses the short, canonical null representation or standard UTF-8 which also uses the short representation for external data. [a]

[a. The difference between CESU-8 and UTF-8 lies in how code points beyond 0x00 to 0x7F are represented; it is immaterial to this discussion. ]

This would, in my opinion, make SQLite adhere more closely to Tcl's design intention. I suspect that the overlong representation is used because it simplifies Tcl's internal operations. If the representation never escapes the internal representation (except by purposeful, debug-like commands), then it is purely an implementation detail. I think the OP was rightfully surprised.

(9) By Richard Damon (RichardDamon) on 2020-08-02 00:56:10 [link] [source] in reply to 7

If, for internal storage, TCL converts strings with embedded nulls into Modified UTF-8, so it can use text with embedded nulls without needing to do the needed workaround to actually force SQLite to handle the embedded nul in the string, that is TCL's business, and it is TCL's responsibility to convert it on reading.

If SQLite doesn't make the change going in, it shouldn't make the change coming out.

It doesn't matter if SQLite has a relationship with TCL, it is not dedicated to TCL, so should not be doing TCLs job for it, or it might break other applications' use of this same encoding trick. Why should SQLite be bent to meet some other language's design intent?

Now, one side effect of using this encoding is that the simple collation sequence the SQLite uses by default will now sort the value 0 between 0x7F and 0x80. If TCL thinks this error is significant, it can either not use the overlong encoding that causes the problem or provide a collation that handles it right. (Maybe it could be argued that SQLite could provide such a collation if the expense isn't that high).

(10) By anonymous on 2020-08-02 06:35:28 [link] [source] in reply to 9

(Maybe it could be argued that SQLite could provide such a collation if the expense isn't that high)

Perhaps as an extension, which would be included with the Tcl interface (but can also be used independently if wanted). (Probably it need not be built in to SQLite.)

(12) By Larry Brasfield (LarryBrasfield) on 2020-08-02 16:13:39 [link] [source] in reply to 9

If, for internal storage, TCL converts strings with embedded nulls into Modified UTF-8, so it can use text with embedded nulls without needing to do the needed workaround to actually force SQLite to handle the embedded nul in the string, that is TCL's business, and it is TCL's responsibility to convert it on reading.

I agree mostly, but you are only restating what I said earlier except for the "force SQLite" part. That is where things become interesting and less clear.

If SQLite doesn't make the change going in, it shouldn't make the change coming out.

That is either assertion of the obvious ("round-tripping should preserve data"), or begging the question. As I see it, the issue (that we are trying to get at) is whether SQLite (or any other Tcl library which is implemented such that it gets Tcl string objects in Tcl's internal form) should convert that internal form into the external (standard UTF-8) form on output and convert such external form back into the internal form on input. I think the answer is "clearly yes."

It doesn't matter if SQLite has a relationship with TCL, it is not dedicated to TCL, so should not be doing TCLs job for it, or it might break other applications' use of this same encoding trick. Why should SQLite be bent to meet some other language's design intent?

This gets to the crux of the matter, I think. I am not saying what SQLite should do generally except insofar as the principles [a] extend to other environments. The question here is what should the Tcl library known as sqlite3 do? For efficiency reasons, Tcl nowadays has internal forms rather than treating everything as strings. And also for efficiency reasons, SQLite for Tcl implements interfaces that operate below the ostensible "everything is a string" level, and so naturally those interfaces traffic in the internal forms. I believe that cross-language database compatibility considerations work to make it desirable for the Tcl SQLite library to perform internal/external form conversions. This has nothing to do with the origin of SQLite as a Tcl library and everything to do with its Tcl library build being a useful and unsurprising Tcl library.

[a. I do think that a SQLite library for a language where strings were represented as length-prefixed UTF-16 should translate those to/from UTF-8 where that is the chosen DB text encoding. ]

For backward compatibility reasons, if this issue is remedied in (the Tcl library form of) SQLite, it will probably require an option of some kind for the UTF-8 collation. Perhaps it should be named "Want_Standard_UTF-8".

(15) By Rowan Worth (sqweek) on 2020-08-03 05:28:55 [link] [source] in reply to 6

I absolutely agree that SQLite's VDBE should not be interfering with or imposing restrictions on the encoding used by client(s).

But the tcl bindings are in a perfect position to encapsulate this tcl implementation detail, arguably they should be doing so. And since the bindings are also part of the SQLite distribution it's not unreasonable to say this is "SQLite's job".

(11) By anonymous on 2020-08-02 09:21:31 [source] in reply to 5

That's funny, because the second Wikipedia page you cited states explicitly that

It should be used exclusively for internal processing and never for external data exchange.

So if the other source is correct, Tcl's got it the wrong way round.

(14) By Larry Brasfield (LarryBrasfield) on 2020-08-02 23:32:19 [link] [source] in reply to 11

I don't see how that is so. I carefully read both of the Wikipedia articles to which I linked, and also verified that Tcl acts as claimed. It uses the overlong nul form internally and outputs or inputs the canonical standard form for file (or stream) I/O. Here is some code with which that can be seen: package require sqlite3 sqlite3 db nullch.sdb db eval { create table if not exists t ( id integer primary key, k integer, v text ) } set ofh [open nullch.txt w] # degree, rubout, nul foreach val { ° x7f x00 } { puts $ofh $val db eval { insert into t (k,v) values (1, $val) } } close $ofh set ifh [open nullch.txt] while {[gets $ifh lin] >= 0} { db eval { insert into t (k,v) values (2, $lin) } } close $ifh db close Examination of the nullch.txt and nullch.sdb files is left as an exercise.

(16) By anonymous on 2020-08-03 06:52:11 [link] [source] in reply to 14

Sorry, my post was a bit short and unclear. What I meant is that CESU-8 is a non-standard form that may be used for internal representations but should not be used in external communication (according to the Wikipedia page on CESU-8, which also states that HTML expressly disallows CESU-8). So Tcl should convert to a standard encoding for external communication and use what ever it pleases internally.

For me, modified UTF-8 seems much closer to the standard (because it differs only in the non-canonical encoding of 0x00) and I felt that it would be better for Tcl to do things the other way round. But of course it's just as non-standard as the other format.

(17) By Richard Damon (RichardDamon) on 2020-08-03 12:01:55 [link] [source] in reply to 16

Personally, I sort of feel the opposite. CESU-8 exists because some platforms adopted UCS-2 when it was a thing, and when Unicode expanded past 16 bits, and UCS-2 sort of morphed into UTF-16, applications that still were thinking UcS-2 would generate CESU-8 for non-BMP characters, seeing them as a funny thing built out of two of the things it thought of as characters. Use of CESU-8 is less likely to cause a security issue for an application that generates CESU-8, as such an application will likely be processing internally as UTF-16, and the sort of issues with CESU-8 (that code points have multiple spellings) would tend not to do much processing on the stream at the CESU-8/UTF-8 encoded phase. Applications that process as UTF-8, would be advised to either reject CESU-8 or clean it up very early in processing.

Modified UTF-8 on the other hand, comes about from wanting to have embedded nuls in strings that are defined as nul terminated. The problem here is that ay step that converts the format to UCS-32 or normalizes the string, will suddenly find the string shortened. Any string that is capable of holding a nul character should not be stored as a nul terminated string but as a counted string. Within an application, if you have defined that you are going to be using the technique is one thing, but then using this method on an external data that isn't defined to do so is asking for problems.

It should be noted, that in this case, SQLite is designed to be able to handle strings with embedded nuls, in them, you just need to be providing it with the full length of the string, This means that SQLite doesn't need this sort of trickery if the applications that use it support strings with embedded nuls, and use the right API for them, and it also supports the use of Modified UTF-8 for those applications that want to handle the issue that way, as SQLite internally processes strings as having a specified length, it just allows applications to pass strings to it that are nul terminated and ask SQLite to figure out their length.

(13) By Poor Yorick (pooryorick) on 2020-08-02 20:06:39 [link] [source] in reply to 1

In a nutshell, tclsqlite.c should use Tcl_UtfToExternal on incoming values to convert the value from the Tcl internal utf-8 form to standard utf-8.