TEXT values can contain arbitrary binary data

(1) By Roy (roywellington) on 2021-03-15 03:43:57 [link] [source]

It appears possible to craft text values (values where typeof() returns 'text') that contain arbitrary binary data.

First, the encoding:

sqlite> PRAGMA encoding;
encoding
--------
UTF-8

Let's create a dummy table:

sqlite> CREATE TABLE test_table (
   ...>   some_text text NOT NULL,
   ...>   CHECK(typeof(some_text) == 'text')
   ...> );

The CHECK constraint isn't necessary for the behavior I'm about to show, but I put it there to convince you that the table won't store binary data.

You can see that SQLite honors the CHECK constraint:

sqlite> INSERT INTO test_table VALUES (x'ff');
Error: CHECK constraint failed: typeof(some_text) == 'text'

Now for the funny part:

sqlite> INSERT INTO test_table VALUES (x'ff' || 'bork');
sqlite>

That works. But what got inserted? (Since, after all, the blob on the LHS of that || could have been coerced.)

sqlite> SELECT typeof(some_text), some_text FROM test_table;
typeof(some_text)  some_text
-----------------  ---------
text               �bork

That's not a literal replacement character, either; that's my terminal replacing the (malformed, to the terminal) raw bytes that made it all the way to it. If I inspect this value in a higher-level language,

row.get_raw(0) = Text([255, 104, 101, 108, 108, 111, 44, 32, 98, 48, 114, 107])
row.get::<_, String>(0) = Err(FromSqlConversionFailure(0, Text, Utf8Error { valid_up_to: 0, error_len: Some(1) }))

(Note: here I inserted a different string, so the ASCII part differs.) We can see that we do get a Text back (which is what the CLI says, too) but that it isn't UTF-8: it starts with a literal 0xff byte.

First, https://www.sqlite.org/datatype3.html says,

TEXT. The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE).

Which I read to mean that text must be text, specifically, it must be valid Unicode, since it will be stored with one of the three listed encoding.

https://sqlite.org/lang_expr.html states,

The || operator is "concatenate" - it joins together the two strings of its operands.

I think the operative bit here is the "strings of its operands", meaning it coverts its operands to strings. So the question, then, is how does BLOB get converted into a string? §4.2 of Datatypes seems to answer that:

BLOB values are converted into TEXT values by simply interpreting the binary BLOB context as a text string in the current database encoding.

While that sort of answers it, it's both very surprising behavior (text values aren't necessarily text!) and it seems to conflict with the earlier bit from the same page: how is a text value containing a 0xff bytes stored in UTF-8? Is this intentional? (And if it is, I think the part that defines the TEXT type should be more explicit about it not actually representing text.)

(I stumbled on this mostly by accident, trying to coerce a text into a blob, to make an INSERT easier.)

(2) By Gunter Hick (gunter_hick) on 2021-03-15 07:00:45 in reply to 1 [link] [source]

This is a good example of the GIGO (Garbage In, Garbage Out) rule. SQLite is very faithfully reproducing the exact garbage string you gave it.

SQLite expects text values to be encoded in the database encoding. Period. It does not check for valid encoding. It is your responsibility to provide valid data, especially if you expect to convert between encodings later using SQLite.

(5) By Keith Medcalf (kmedcalf) on 2021-03-15 13:58:40 in reply to 2 [link] [source]

SQLite expects text values to be encoded in the database encoding.

This is incorrect.

SQLite3 expects that incoming string values will correspond to the constraints which you the programmer have specified apply to the value so passed as regards to the encoding (UTF-8 or UTF-16 depending on the API call used), and that the value is a properly formatted C String (a bunch of non-NULL codepoints followed by a NULL codepoint terminator).

SQLite3 will then return string values in the format corresponding to the API used by the programmer to retrieve the text string (UTF-8 or UTF-16), as a C String (a bunch on non-NULL codepoints followed by a NULL codepoint).

If the API used to "send" the value to SQLite3 specifies a different encoding than the database encoding, then SQLite3 will translate the "value" for storage.

Similarly, if the API used to "retrieve" the value from SQLite3 specified a different encoding that the database encoding, then SQLite3 will translate the "value" for presentement to the user.

If the API used to "send", the database encoding, and the API used to "retrieve" indicate the same encoding, then SQLite3 does not inspect or molest the data in any way (it has no need to do so).

The only difference between a blob and text is that text has an encoding specified by the user which may be used to provide semantic meaning to the bag-o-bytes in the event that this information is required (such as translating encodings, applying string functions, et al).

If one wishes to ensure that a text field contains only validly encoded text then one may write a user-defined function for use in a check constraint. If you do not, SQLite3 believes and relies on the fact that you are telling the truth and if you are not, then the GIGO principle applies.

(7) By Roy (roywellington) on 2021-03-15 23:49:49 in reply to 5 [link] [source]

If you do not, SQLite3 believes and relies on the fact that you are telling the truth

While I understand that this is a stance a project might take, part of the point of my original post is that I think this is not properly documented, if it is the stance SQLite desires. (Vs. considering storing non-text in text columns a bug.) I think where the docs state,

TEXT. The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF-16LE).

This is empirically not true: it simply isn't possible to store 0xff in UTF-8. If SQLite is literally storing 0xff <the rest of my query's string, in UTF-8>, I think that's okay, just that the docs should be more forthcoming that SQLite doesn't validate the data stored in it.

I also dug into, a bit, the docs around the actual storage format; they state,

Value is a string in the text encoding and (N-13)/2 bytes in length. The nul terminator is not stored.

"the text encoding" is the PRAGMA encoding; of the SQLite file, in my case, UTF-8. But, again, empirically, we seem to have managed to store something that isn't UTF-8 there.

Similarly, if the API used to "retrieve" the value from SQLite3 specified a different encoding that the database encoding, then SQLite3 will translate the "value" for presentement to the user.

I want to make sure something else is clear, around this point. I submit queries to SQLite, which are text strings in some encoding. In the example I presented, the query is entirely ASCII, and thus also valid UTF-8. It ends up computing and storing a value that isn't UTF-8.

(The other stance being that this behavior is a bug. I personally think that's the far more useful stance, to a user of the library, as it catches errors and prevents further nonsense from occurring down the road. But I acknowledge that SQLite might disagree with that; but then I think the invariants, or lack thereof, should be clearly documented.)

(3) By J. King (jking) on 2021-03-15 07:01:30 in reply to 1 [link] [source]

While I agree it's a surprising result the documentation for expressions also states in part:

The result of any binary operator is either a numeric value or NULL, except for the || concatenation operator which always evaluates to either NULL or a text value.

Thus concatenating any two non-null values will always yield text even if the value is invalid for the database encoding.

(4) By Ryan Smith (cuz) on 2021-03-15 09:44:21 in reply to 1 [source]

To add to the other posters very correct replies, I'd like to point out that this is true of most (maybe even all) DB engines.

If in MySQL you do the same thing:

CREATE TABLE IF NOT EXISTS test_table (
  some_text text NOT NULL
);

INSERT INTO test_table VALUES (CONCAT(0xff, 'bork'));

SELECT some_text FROM test_table;

You get pretty much the same stuff back: some_text

�bork

(or, if you run it in a thing that can display those characters): some_text

ÿbork

(6) By Roy (roywellington) on 2021-03-15 23:36:37 in reply to 4 [link] [source]

No, it is definitely not true of most databases.

E.g., in PostgreSQL, you'll receive this error:

postgres=# CREATE TABLE foo ( some_text text NOT NULL );
CREATE TABLE

postgres=# INSERT INTO foo VALUES (E'\xff' || 'test');
ERROR:  invalid byte sequence for encoding "UTF8": 0xff

postgres=# INSERT INTO foo VALUES (E'\x61' || 'test');
INSERT 0 1
postgres=# select * from foo;
 some_text 
-----------
 atest
(1 row)

MariaDB, which I'm using for MySQL here, but I believe MySQL proper behaves similarly,

MariaDB [(none)]> CREATE DATABASE mydatabase CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
Query OK, 1 row affected (0.001 sec)

MariaDB [(none)]> USE mydatabase;
Database changed
MariaDB [mydatabase]> CREATE TABLE foo ( some_txt text NOT NULL );
Query OK, 0 rows affected (0.125 sec)

MariaDB [mydatabase]> INSERT INTO foo VALUES (X'ff');
ERROR 1366 (22007): Incorrect string value: '\xFF' for column `mydatabase`.`foo`.`some_txt` at row 1
MariaDB [mydatabase]> INSERT INTO foo VALUES (X'31');
Query OK, 1 row affected (0.098 sec)

MariaDB [mydatabase]> select * from foo;
+----------+
| some_txt |
+----------+
| 1        |
+----------+
1 row in set (0.000 sec)

Now, MySQL has some legacy baggage; last I checked, the default encoding is still latin1, and in that encoding, any byte sequence is a valid string. (There are no malformed sequences of bytes, which isn't true in UTF-8.)

Your exact example, in MariaDB:

MariaDB [mydatabase]> CREATE TABLE IF NOT EXISTS test_table (
    ->   some_text text NOT NULL
    -> );
Query OK, 0 rows affected (0.032 sec)

MariaDB [mydatabase]> INSERT INTO test_table VALUES (CONCAT(0xff, 'bork'));
ERROR 1366 (22007): Incorrect string value: '\xFFbork' for column `mydatabase`.`test_table`.`some_text` at row 1

But again, the database here is configured to be UTF-8. Being configurable to a different encoding is different from the observed SQLite behavior, too: MySQL/MariaDB will encoding/decode appropriately, but it's still something it is very much aware of, in order to do so. (It's important for certain operations, like string comparison.)

(8) By Ryan Smith (cuz) on 2021-03-16 02:11:53 in reply to 6 [link] [source]

I think I missed your original point just a tad. I thought you were confused about why SQLite will store a BLOB as text when you ask it to, and then keep on treating it like text.

I then went on to prove other DBs do it too, but I now realize your confusion is more precisely about the BLOB being not valid UTF8 whilst you have set the pragma encoding to be specifically UTF8.

All I can say (to reiterate what others have said): SQLite doesn't care.

While the SQLite connection can speak UTF8, and can store UTF8 text, it can also still store ANY bytes, it places Zero requirement on a Text column to only store UTF8-happy text. In the end, it's all just bytes.
(I won't go into reasons - this topic is rehashed on the forum ad-infinitum, you can find several long dissertations in just the recent history [search bag-'o-bytes for fun], so I will stop here and simply assert that it is so.)

Put another way, if I ask you to read to me from a Russian book, but translate it to French on the fly so I can understand, it will ONLY work as long as the book contains valid Russian words. If the book suddenly starts containing nonsense garbled letters, or perhaps valid Chinese words, then no amount of Russian-to-French translation will help me, no matter how good of a translator you are. If the Author of the book however strictly ensured that only valid Russian words could be placed in the book, then one could fairly expect a fully comprehensible translation - but SQLite is not such an author, and it believes (unwisely perhaps) that you are the Author-in-chief.