CREATE TABLE STRICT allows malformed utf-8 TEXT

(1) By anonymous on 2024-07-18 16:59:50 [link] [source]

I have a C code snippet here, I expected that sqlite3_bind_text would fail with SQLITE_FORMAT or some other error, but instead it succeeds. The byte array written as "xffxffxff" is not utf-8, the documentation of the type TEXT says that only utf-8 or utf-16 with some endianness is allowed. https://sqlite.org/datatype3.html

Testing a SELECT statement with the sqlite3 shell command shows that indeed, in ./db there's this "xffxffxff" value written (viewed with a hexview tool)

Is this a bug? If no, how do I make the bind raise an error for the malformed utf-8? ```

#include <stdio.h>

#include <sqlite3.h>

int main() { sqlite3* conn; int status = sqlite3_open("./db", &conn);

    if (status != SQLITE_OK)
    {
            printf("open failed\n");
            return 1;
    }

    char* errMsg;

    status = sqlite3_exec(conn, "CREATE TABLE t1(value TEXT) STRICT", NULL, NULL, &errMsg);

    if(status != SQLITE_OK)
    {
            printf("create table failed: %s\n", errMsg);
            return 1;
    }

    sqlite3_stmt* statement;

    const char* errMsgConst;
    status = sqlite3_prepare_v2(conn, "INSERT INTO t1 VALUES(?) ", -1, &statement, &errMsgConst);

    if(status != SQLITE_OK)
    {
            printf("create statement failed: %s\n", errMsgConst);
            return 1;
    }

    status = sqlite3_bind_text(statement, 1, "\xff\xff\xff", 3, SQLITE_STATIC);

    if(status != SQLITE_OK)
    {
            printf("bind failed: %s\n", errMsg);
            return 1;
    }

    status = sqlite3_step(statement);

    if(status != SQLITE_DONE)
    {
            printf("step failed: %s\n", errMsg);
            return 1;
    }



    //leak everything

} ```

(2) By Stephan Beal (stephan) on 2024-07-18 17:09:47 in reply to 1 [link] [source]

Is this a bug? If no, how do I make the bind raise an error for the malformed utf-8?

The API docs say:

If the third parameter to sqlite3_bind_text() is not NULL, then it should be a pointer to well-formed UTF8 text.

It is up to the caller to ensure that it is well-formed.

It also says:

The sixth argument to sqlite3_bind_text64() must be one of SQLITE_UTF8, SQLITE_UTF16, SQLITE_UTF16BE, or SQLITE_UTF16LE to specify the encoding of the text in the third parameter. If the sixth argument to sqlite3_bind_text64() is not one of the allowed values shown above, or if the text encoding is different from the encoding specified by the sixth parameter, then the behavior is undefined.

Noting that bind_text() and bind_text64() are both thin wrappers around the same functionality, and bind_text() implicitly passes SQLITE_UTF8 to the internal implementation, so has the same undefined results if its input is not UTF-8.

(3) By anonymous on 2024-07-18 17:18:22 in reply to 2 [link] [source]

So, TEXT can hold arbitrary data, and users must check the bytes when reading from the database if someone somewhere demands utf-8? the sqlite3 engine is working exactly as intended?

(4) By Stephan Beal (stephan) on 2024-07-18 17:26:59 in reply to 3 [link] [source]

So, TEXT can hold arbitrary data, and users must check the bytes when reading from the database if someone somewhere demands utf-8?

Ideally the programmer would check it, if it's relevant, before stuffing it in the db.

the sqlite3 engine is working exactly as intended?

Yes. The intent is exactly what the docs say: the caller is expected to ensure that when they tell SQLite "this is UTF-8" (by passing text to sqlite3_bind_text()) that it really is UTF-8.

SQLite doesn't generally care what the encoding is until/unless it's asked to use a different encoding later, at which point it will misbehave if the text in question is not in the encoding which the programmer told it the text was in.

(5) By anonymous on 2024-07-18 17:33:52 in reply to 4 [link] [source]

Ok, so if this is how TEXT works, and only if someone asks for encoding then checks will be made, what is the difference between BLOB and TEXT?

From the docs I can see that BLOB has some memory-efficient ways to edit values from a table

sqlite3_blob_bytes sqlite3_blob_close sqlite3_blob_open sqlite3_blob_read sqlite3_blob_reopen sqlite3_blob_write

which the type TEXT doesn't have available, the entire TEXT value needs to be kept/loaded/saved in memory.

Are there any other important differences between TEXT and BLOB?

(7) By Richard Hipp (drh) on 2024-07-18 17:42:01 in reply to 5 [link] [source]

A few differences between TEXT and BLOB:

If you insert text as UTF8 and later ask for UTF16, then SQLite will do the conversion for you. Such conversions do not happen for BLOBs.
The "length()" function returns the number if characters for TEXT, but the number of bytes for a BLOB.
TEXT can be compared using various application-defined comparison functions. The comparison function for BLOB is always memcmp().

There are other differences. Those are just the first three that come to mind.

(15) By anonymous on 2024-07-19 13:22:53 in reply to 7 [link] [source]

I see. Thank you for the explanation. I believe this answers everything I wanted to know.

With manual testing I spotted one inconsistency between the C sqlite API and the shell command sqlite3 when running SELECT queries.

I can't decide if this is a bug or intended behavior.

In short, if a column of type TEXT contains a value which has the char '\0' in the middle, the full value is shown by the C sqlite API, but the value is truncated up to the first '\0' char when the shell command sqlite3 is used.

A.k.a. it seems that the shell command sqlite3 doesn't use something like fwrite(buff, sizeof(char), byteCount, stdout) and instead it uses something like fprintf(stdout, "%s\n", buff);

I have a code snippet here demonstrating the C sqlite API output:

#include <stdio.h>
#include <sqlite3.h>

int main()
{
        sqlite3* conn;
        int status = sqlite3_open("./db", &conn);

        if (status != SQLITE_OK)
        {
                printf("open failed\n");
                return 1;
        }

        char* errMsg;

        status = sqlite3_exec(conn, "CREATE TABLE t1(value TEXT) STRICT", NULL, NULL, &errMsg);

        if(status != SQLITE_OK)
        {
                printf("create table failed: %s\n", errMsg);
                return 1;
        }

        sqlite3_stmt* statement;

        const char* errMsgConst;
        status = sqlite3_prepare_v2(conn, "INSERT INTO t1 VALUES(?) ", -1, &statement, &errMsgConst);

        if(status != SQLITE_OK)
        {
                printf("create statement failed: %s\n", errMsgConst);
                return 1;
        }

        status = sqlite3_bind_text(statement, 1, "\xff\x00\xff", 3, SQLITE_STATIC);

        if(status != SQLITE_OK)
        {
                printf("bind failed: %s\n", errMsg);
                return 1;
        }

        status = sqlite3_step(statement);

        if(status != SQLITE_DONE)
        {
                printf("step failed: %s\n", errMsg);
                return 1;
        }

        status = sqlite3_finalize(statement);

        if(status != SQLITE_OK)
        {
                printf("finalize failed: %s\n", errMsg);
                return 1;
        }

        status = sqlite3_prepare_v2(conn, "SELECT * FROM t1 ", -1, &statement, &errMsgConst);

        if(status != SQLITE_OK)
        {
                printf("create statement failed: %s\n", errMsgConst);
                return 1;
        }

        status = sqlite3_step(statement);

        if(status != SQLITE_ROW)
        {
                printf("step failed: %s\n", errMsg);
                return 1;
        }

        const unsigned char* text = sqlite3_column_text(statement, 0);
        int byteCount = sqlite3_column_bytes(statement, 0);

        for(int i = 0; i < byteCount; ++i)
        {
                printf("%c at index %d, ascii code %d\n", (char)(text[i]), i, (int)(text[i]));
        }



        //leak everything
}

The output is:

� at index 0, ascii code 255
 at index 1, ascii code 0
� at index 2, ascii code 255

When grabbing the file ./db and running

sqlite3 ./db 'SELECT * FROM t1'

the result is

�

So, is this a bug in the shell command sqlite3? Or is this intended behavior?

Tested with:

$ sqlite3 --version
3.40.1 2022-12-28 14:03:47 df5c253c0b3dd24916e4ec7cf77d3db5294cc9fd45ae7b9c5e82ad8197f38a24

(21) By Stephan Beal (stephan) on 2024-07-21 05:18:02 in reply to 15 [link] [source]

So, is this a bug in the shell command sqlite3? Or is this intended behavior?

From the docs:

The result of expressions involving strings with embedded NULs is undefined.

(23.1) Originally by anonymous with edits by Dan Kennedy (dan) on 2024-08-07 14:17:29 from 23.0 in reply to 21 [link] [source]

Okay, this "undefined" term you showed to me written in the docs, I am not sure what it is supposed to mean, to make the shell command sqlite3 reveal the real length for the TEXT type I used casting to BLOB. BLOB type getting printed doesn't show bytes after the first 0x00 byte, but I guess this is normal.

I am posting this example here if anyone ever needs it for decrypting the real length measured in bytes:

SQLite version 3.7.17 2013-05-20 00:56:22
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table t1(
   ...> valText TEXT,
   ...> valBlob BLOB
   ...> );
sqlite> INSERT INTO t1 VALUES ( CAST(X'4241004241' as TEXT), X'4241004241');
sqlite> SELECT valText, length(valText), typeof(valText), length(hex(valText))/2, valBlob, length(valBlob), typeof(valBlob), length(hex(valBlob))/2 FROM t1;
valText     length(valText)  typeof(valText)  length(hex(valText))/2  valBlob     length(valBlob)  typeof(valBlob)  length(hex(valBlob))/2
----------  ---------------  ---------------  ----------------------  ----------  ---------------  ---------------  ----------------------
BA          2                text             5                       BA          5                blob             5
sqlite>

(24) By Stephan Beal (stephan) on 2024-08-07 14:44:09 in reply to 23.1 [link] [source]

Okay, this "undefined" term you showed to me written in the docs, I am not sure what it is supposed to mean,

https://en.wikipedia.org/wiki/Undefined_behavior

SQLite version 3.7.17 2013-05-20 00:56:22

Note that much has changed in the 11 years since that version was current:

sqlite> select octet_length(CAST(X'4241004241' as TEXT)), octet_length(X'4241004241');
5|5

(8) By Gunter Hick (gunter_hick) on 2024-07-18 20:06:42 in reply to 5 [link] [source]

TEXT is exactly what one qould expect: It is a string of characters terminated by a NUL character. You can get away with using an encoding other than UTF until you ask for a conversion between UTF encodings.

BLOB is also excatly what one would expect: It is a sequence of bytes with no implied structure or meaning and a defined length.

If you (intentionally or not) cast a BLOB to TEXT, it will be truncated at the first NUL character; or, if there is no NUL character inside, there will be one added to the end.

(16) By anonymous on 2024-07-19 13:26:18 in reply to 8 [link] [source]

"TEXT is exactly what one qould expect: It is a string of characters terminated by a NUL character."

With testing we can see that this is not true. Legal utf-8 can have '\0' stuck in the middle in an sqlite3 TEXT value, any programming language that works with utf-8 will accept this too. I learned about this from the man page of utf-8. Refer to comment (15) I wrote in this thread.

(17) By Stephan Beal (stephan) on 2024-07-19 13:31:48 in reply to 16 [link] [source]

Legal utf-8 can have '\0' stuck in the middle in an sqlite3 TEXT value

Noting that the sqlite docs say:

The result of expressions involving strings with embedded NULs is undefined.

(6) By Richard Hipp (drh) on 2024-07-18 17:36:28 in reply to 4 [link] [source]

Just to amplify what Stephan says above:

If you call sqlite3_bind_text() or similar and tell SQLite that the text you are providing is UTF8, then SQLite believes you. If what you told it is not true, and bad things happen later, then that is on you.

Note that you can lie to SQLite about the encoding of text all you want to without harming the database - in the sense that you won't cause a memory error or a crash or anything like that. But you might get back unexpected answers to subsequent queries. If this is a problem for you, then don't lie to the database when you are inserting the text to begin with.

(9) By Anton Dyachenko (ahtoh_) on 2024-07-18 20:25:28 in reply to 6 [link] [source]

It seems to me that some logical consistency is missing here. Sqlite current behaviour is fine when no STRICT keyword is used, but the expectation from strict is more then current behaviour. It is expected not just never convert automatically but also ensure pre conditions in this case ensuring that data is actually valid UTF-8 string.

In C++ std library there's just no facilities to make this check and there are many code that tries to avoid other dependencies as much as possible but sqlite for sure already have this check internally, so at least exposing it to public API would be great but also implicit check in case of strict tables is even better.

Many sqlite users do not work directly with sqlite but instead using wrapper libraries. Therefore in many cases they can't inject that check into the wrapper call. So doing this check in user code means either adding it explicitly before each bind (which requires extraordinarily discipline) or writing their own wrapper on top of the wrapper which is even rarer then discipline.

As an example of how this causes me problems regularly are paths. On windows std::filesystem::path::string returns result in local multibyte encoding not in utf-8 but on other platforms in utf-8. Devs do not expect such differences especially if their working environment unix like, C++20 makes this even worsen as u8string is not convertible to const char* anymore. So I have to fix such issues regularly and if sqlite could always ensure data is valid UTF-8 Text for strict tables it would make my life better as problems noticed earlier.

So in short it is not devs intention lying to sqlite it is lack of possibility to check data and/or genuine mistreated results/weak type safety from other API.

(10) By Richard Hipp (drh) on 2024-07-18 20:31:24 in reply to 9 [link] [source]

sqlite for sure already have this check internally

Actually, it does not. In order for SQLite to start validating UTF8 inputs, we'd have to write a UTF8 validator function.

(12) By Anton Dyachenko (ahtoh_) on 2024-07-19 05:17:01 in reply to 10 [link] [source]

I must not understand then how sqlite converts between utf 8 <->16. In order to read codepoints the convertion has to decode the string. How is it possible to decode without validation and be memory safe? I can understand that validation is embedded inside convertion and doesn't exists as a standalone function but that shouldn't be very hard to separate validation and convertion.

(13) By Nuno Cruces (ncruces) on 2024-07-19 08:01:29 in reply to 12 [link] [source]

UTF-16 conversions can be disabled at compile time.

Also, you're either proposing making SQLite slower for everyone else to satisfy your request, or (realistically, because of strict backwards compatibility promises) another non-default option that needs to be tested.

(19) By Anton Dyachenko (ahtoh_) on 2024-07-19 22:29:14 in reply to 13 [link] [source]

Utf validation by definition can't break compatibility because docs said if you provided invalid text then behaviour is undefined. Undefined behaviour includes returning an error in case a future version decides to validate the input. So I understand that people may be upset with this change and sqlite team may want to do it as a pragma but technically speaking it is a compatible change.

The argument about speed of this extra check is contradicts the main narrative when someone asks why sqlite might be slow in some cases. The first and main performance bottleneck is not CPU but IO. Also utf validation is a solved problem with very fast implementation.

(20) By anonymous on 2024-07-20 01:23:42 in reply to 19 [link] [source]

Since SQLite is a library, you could make your own wrapper for SQLite that does all the checking, scubbing, etc that you want. simply call your function instead of the wrapped function.

That said, I sometimes reminisce of a simpler time and long for the days of 7 bit ASCII (I do not miss BAUDOT or paper tape readers one bit).

Seriously, I think there is slim to zero chance the SQLite library will ever include what you propose simply due to its size, not to mention a whole application space where it is superfluous (and sometimes detrimental).

(22.1) By Nuno Cruces (ncruces) on 2024-07-22 13:51:33 edited from 22.0 in reply to 19 [source]

That's, unfortunately, not how it works for a library in such wide usage as SQLite.

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

Hyrum Wright, 2012

It's on you to convince developers that your proposed change is better for everyone (that benefits will vastly outweigh any breakages).

And many users of SQLite are different from you. How many times should an embed, low resource, system validate that a string is UTF-8 before storing it?

Also in terms of speed, any arguments on SIMD instantly fall flat (where are assembly intrinsics used?), even SWAR would be a stretch, as it'd need to work across 32/64-bit and little/big-endian.

But that's not the issue, the UTF conversion code can be #ifdefed away: because it's optional. Should the validator too?

And really, if this is an afternoon's work, why don't you fix it on your end? Then you can use SIMD and be fast.

(14) By Stephan Beal (stephan) on 2024-07-19 08:56:56 in reply to 12 [link] [source]

How is it possible to decode without validation and be memory safe?

Recoding is of course validated, but the results are, in the general case, undefined if the inputs are not of the expected encoding. Note that, as Richard clarified up-thread, "undefined" in this case does not mean memory faults or database file corruption. It means that the resulting text will necessarily not be a well-formed conversion of the ill-formed input. Garbage in, garbage out.

(18) By Anton Dyachenko (ahtoh_) on 2024-07-19 22:22:01 in reply to 14 [link] [source]

Full validation of utf is pretty much trivial also you can easily find a paper in the internet about simd very fast implementation which is based on state machine approach without simd and also very fast. Basically it is a matter of hours to choose what you want to implement and do it and of course tests likely a day(s). So the work itself is small and a solved problem.

Utf is the standard for all texts with us for a long time and even outlive sqlite 2050 support promise, so having valid text in database is a major quality of a db. I don't really understand why you insisting it doesn't not worth a little effort to ensure text is valid. Also ensuring this outside sqlite is really not a simple problem in general case by many reasons I already provided.

(11) By Tim Streater (Clothears) on 2024-07-18 20:51:44 in reply to 9 [link] [source]

Sometimes it's not the devs lying anyway. If you handle email, as I do, you may receive a chunk of text as an email part, that the sender is insisting is UTF-8. Eventually I learnt to at least do a minimal check on this. What I actually do is just check that the byte sequence follows the UTF-8 encoding rules; there's a decent enough article in Wikipedia about that. And I replace faulty byte sequences with the replacement character. Only then do I store into an SQLite database.