SQLite Forum

SQLite3 is far too tolerant of broken text data
Login

SQLite3 is far too tolerant of broken text data

(1) By anonymous on 2022-12-02 02:12:09 [source]

The zip archive available at https://www.stacken.kth.se/~blgl/sqlite3-too-tolerant.zip contains two example databases with interesting properties.

  • pragma integrity_check reports no problems for either.

  • The shell's .dump command produces identical output for both.

  • The shell's .sha3sum command produces identical output for both.

  • This query produces different output for each:

    select group_concat(c||x'00','') from t;
    

  • The differences persist across VACUUMs.

(2.1) By Larry Brasfield (larrybr) on 2022-12-02 02:55:49 edited from 2.0 in reply to 1 [link] [source]

(After studying the two DBs, I see that the user error is even more elementary than I thought at first glance. Hence an edit.)

https://sqlite.org/lang_aggfunc.html#group_concat

The behavior which is the subject of your kindly shared opinion is already known and mentioned at the linked doc.1 I am a bit busy to be attempting any education on why this behavior will continue, as I suspect it is obvious to you. However, I must say that serious users of SQLite are well advised to understand its advertised major deviations from what might be expected by those experienced with other DBMSs. Otherwise, they may find themselves properly counted among certain members of the poor.


  1. ^ In particular, the ordering that goes into the inputs of a group_concat(...) matters. Lo, unordered queries can come out in a different order, apparently after a VACUUM.

(3) By anonymous on 2022-12-02 02:52:32 in reply to 2.0 [link] [source]

I'm confused. Section 6 of the Quirks document that you linked is "Aggregate Queries Can Contain Non-Aggregate Result Columns That Are Not In The GROUP BY Clause" which is entirely unrelated to this matter. Did you mean to link something else?

(4.1) By Larry Brasfield (larrybr) on 2022-12-02 03:08:53 edited from 4.0 in reply to 3 [link] [source]

See my corrected post 1.1 -- my original suffered from a misread coupled with not imagining the real cause of the complained-of query result variation.

I am curious to know whether the complainant finds that other DBSMs produce consistently ordered query results where no "ORDER BY" clause is used. In my experience, that is rarely the case for very long.

The complainant might this helpful, diagnostic pragma to be a redeeming feature of SQLite.

Further, the complainant might be less surprised and/or judgmental if Quirk #12 was examined and its implications understood.1


  1. ^ I suspect the effect of NULs was understood in order to produce the trickery seen in those query results.

(5) By Rowan Worth (sqweek) on 2022-12-02 03:12:57 in reply to 2.1 [link] [source]

I feel like you've missed the mark on this one. Each DB is 2 pages long, and the payload contents are very clearly different:

showdb db1.sqlite 2
Pagesize: 512
Available pages: 1..2
Page 2:   (offsets 0x200..0x3ff)
 000: 0d 00 00 00 1a 01 7e 00 01 fb 01 f6 01 f1 01 ec ......~.........
 010: 01 e7 01 e2 01 dd 01 d8 01 d3 01 ce 01 c9 01 c4 ................
 020: 01 bf 01 ba 01 b5 01 b0 01 ab 01 a6 01 a1 01 9c ................
 030: 01 97 01 92 01 8d 01 88 01 83 01 7e 00 00 00 00 ...........~....
 040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 0a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 0b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 0c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 0d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 0e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 0f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03 1a ................
 180: 02 0f 73 03 19 02 0f 6e 03 18 02 0f 61 03 17 02 ..s....n....a...
 190: 0f 67 03 16 02 0f 69 03 15 02 0f 6e 03 14 02 0f .g....i....n....
 1a0: 61 03 13 02 0f 6e 03 12 02 0f 65 03 11 02 0f 68 a....n....e....h
 1b0: 03 10 02 0f 73 03 0f 02 0f 20 03 0e 02 0f 63 03 ....s.... ....c.
 1c0: 0d 02 0f 69 03 0c 02 0f 68 03 0b 02 0f 70 03 0a ...i....h....p..
 1d0: 02 0f 61 03 09 02 0f 72 03 08 02 0f 67 03 07 02 ..a....r....g...
 1e0: 0f 6f 03 06 02 0f 6e 03 05 02 0f 61 03 04 02 0f .o....n....a....
 1f0: 67 03 03 02 0f 65 03 02 02 0f 74 03 01 02 0f 73 g....e....t....s
showdb db2.sqlite 2
Pagesize: 512
Available pages: 1..2
Page 2:   (offsets 0x200..0x3ff)
 000: 0d 00 00 00 1a 01 7e 00 01 fb 01 f6 01 f1 01 ec ......~.........
 010: 01 e7 01 e2 01 dd 01 d8 01 d3 01 ce 01 c9 01 c4 ................
 020: 01 bf 01 ba 01 b5 01 b0 01 ab 01 a6 01 a1 01 9c ................
 030: 01 97 01 92 01 8d 01 88 01 83 01 7e 00 00 00 00 ...........~....
 040: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 080: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 090: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 0a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 0b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 0c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 0d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 0e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 0f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 100: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 110: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 120: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 130: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 140: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 150: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 160: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
 170: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03 1a ................
 180: 02 0f 65 03 19 02 0f 67 03 18 02 0f 61 03 17 02 ..e....g....a...
 190: 0f 73 03 16 02 0f 73 03 15 02 0f 65 03 14 02 0f .s....s....e....
 1a0: 6d 03 13 02 0f 20 03 12 02 0f 74 03 11 02 0f 65 m.... ....t....e
 1b0: 03 10 02 0f 72 03 0f 02 0f 63 03 0e 02 0f 65 03 ....r....c....e.
 1c0: 0d 02 0f 73 03 0c 02 0f 20 03 0b 02 0f 72 03 0a ...s.... ....r..
 1d0: 02 0f 65 03 09 02 0f 68 03 08 02 0f 74 03 07 02 ..e....h....t...
 1e0: 0f 6f 03 06 02 0f 6e 03 05 02 0f 61 03 04 02 0f .o....n....a....
 1f0: 20 03 03 02 0f 74 03 02 02 0f 65 03 01 02 0f 79  ....t....e....y

So how do they result in an identical dump/checksum?

Part of the answer is that some of this information is "invisible" to sqlite -- ie. the .dump does not preserve any of the characters in the payload.

My question is can such a database be created through normal SQL operations, or was it handcrafted?

(6) By anonymous on 2022-12-02 03:23:49 in reply to 5 [link] [source]

My question is can such a database be created through normal SQL operations, or was it handcrafted?

No special crafting and you don't even have to go below the SQL level. To add an exclamation to the hidden text, run this:

insert into t values(x'21');

(7) By anonymous on 2022-12-02 03:31:10 in reply to 6 [link] [source]

Sorry, brainfart. Should be:

insert into t values(cast(x'21' as text));

(8) By Rowan Worth (sqweek) on 2022-12-02 03:54:51 in reply to 7 [link] [source]

Ah, there you go. I note this doesn't cause any surprises on a fresh database, which lead me to the full steps to reproduce (yes my version is old, deal with it :P):

$ sqlite3
SQLite version 3.7.17 2013-05-20 00:56:22
sqlite> pragma encoding='utf-16';
sqlite> CREATE TABLE t(c text not null);
sqlite> insert into t values(cast(x'21' as text));
sqlite> select * from t;

sqlite> .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE t(c text not null);
INSERT INTO "t" VALUES('');
COMMIT;

sqlite> select group_concat(c||x'00') from t;
!
sqlite> select hex(c) from t;
21

ie. it relies on storing incomplete utf-16 characters and the relevant sqlite documentation is here:

https://sqlite.org/invalidutf.html#best_effort_to_preserve_text

However I think this represents an exception to this rule:

Generally speaking, if you insert invalid UTF into SQLite, you will get the exact same byte sequence back out, as long as you do not ask SQLite to transform the text in any way.

Especially as the data is not captured in a dump.

(9) By anonymous on 2022-12-02 04:33:29 in reply to 8 [link] [source]

Especially as the data is not captured in a dump.

The .dump command uses UTF-8 only and can't see UTF-16 irregularities.

Anyway, my argument is that supposed UTF-16 text with an odd byte count has a higher degree of wrongness than mere incorrect byte values.

(10) By Larry Brasfield (larrybr) on 2022-12-02 12:04:02 in reply to 9 [link] [source]

the data is not captured in a dump.

I'm not sure that belongs in the defect column. Running sqldiff against the two DBs results in DML to set all fields to '', consistent with the dump output.

Anyway, my argument is that supposed UTF-16 text with an odd byte count has a higher degree of wrongness than mere incorrect byte values.

I more or less agree. Rather than settling for a simplistic GIGO principle or maxim, we should insist on some constant, or acceptable range, of GO/GI.

The difficult part of this will be devising meaningful measures of GI and GO.

(11) By Larry Brasfield (larrybr) on 2022-12-02 13:19:06 in reply to 9 [link] [source]

This setup and the 3 queries shown: pragma encoding='UTF-16le'; attach 'db1.sqlite' as one; attach 'db2.sqlite' as two; select count(*) from one.t a, two.t b where a.rowid=b.rowid; select count(*) from one.t a, two.t b where a.rowid=b.rowid and cast(a.c as blob)=cast(b.c as blob); select hex(cast(a.c as blob)) from one.t a, two.t b where a.rowid=b.rowid and cast(a.c as blob)=cast(b.c as blob); yield results of 26 and 4 and 61 6E 6F 61 (in the same order.)

Clearly, these are not the same, just as post #1 says.

I do not sympathize much with the critique, given that I see most of it as yet another demo of GIGO in action.

However, I do think that the shell's .sha3sum command should have an option to make its reported digest sensitive to the whole field content of what the tables it is asked to hash. I will discuss adding such an option with the other project devs.

(12) By ddevienne on 2022-12-02 13:29:02 in reply to 11 [link] [source]

should have an option to make its reported digest sensitive to the whole field content

When exactly should it be the default to NOT hash the whole field?
Your logic seems backwards to me here, Larry. Of course the whole field should be hashed.

There's also the issue of pragma integrity_check and invalid code-points.
That it doesn't report anything, in this case.

Or maybe we should have a new pragma check_text that verifies encoding, and possibly embedded NULLs? (although those, embedded NULLs, may merit their own pragma?)

(13) By Larry Brasfield (larrybr) on 2022-12-02 13:37:48 in reply to 12 [link] [source]

When exactly should it be the default to NOT hash the whole field?
Your logic seems backwards to me here, Larry. Of course the whole field should be hashed.

While I tend to agree, as a general proposition, I am not sure this would be consistent with the usual backward compatibility objectives.

A lawyerly argument can be made1 to the effect that for straight-forward uses of the DB, the .sha3sum digest is useful and valid in a sense, (perhaps a strained one.) For that reason, I am not sure that some existing users of the command would not be poorly served by changing the default in a release. And I hesitate to say they will get what's coming to them.


  1. ^ I formulated and considered that argument, but did not like it much.

(14) By ddevienne on 2022-12-02 13:53:50 in reply to 13 [link] [source]

We are not talking about the SQLite core here.

The BC rules are relaxed for non-Core stuff IMHO.
Especially when about fixing what can justifiably be considered a bug.
The fact that bug is a corner case, is all the more reason to fix it too.

(15) By Larry Brasfield (larrybr) on 2022-12-02 13:54:20 in reply to 12 [link] [source]

(I'm responding separately to what threatens to become an off-topic sub-thread.)

There's also the issue of pragma integrity_check and invalid code-points.

The present docs are clear on what pragma integrity_check results mean, and that is not contradicted by these oddly created DBs.

The present docs are also clear on encoding validity checking. It is wholly the user's responsibility.

(16) By ddevienne on 2022-12-02 14:10:13 in reply to 15 [link] [source]

The present docs are clear on what pragma integrity_check results mean

That's not in dispute.

encoding validity checking ... is wholly the user's responsibility

That's where SQLite could be more helpful, OTOH.

SQLite could (should?) have new additional opt-in check pragma(s),
so as to not change the behavior of existing check pragmas.
It already has the necessary support code, around encodings.

Just saying. Whether this happens or not is beyond my control.
And I'm well aware of the dev and maintenance cost, etc...
I just still think it's worth it, long term. My $0.02.

(17) By Larry Brasfield (larrybr) on 2022-12-02 14:36:28 in reply to 14 [link] [source]

(Swapping shyster and advocate hats, and not saying which:)

From the doc on Binding Values ...: "If the third parameter to sqlite3_bind_text16() is not NULL, then it should be a pointer to well-formed UTF16 text."

I would say that when this precondition for defined API usage results is violated, then all bets are off for the results, including what happens with tools.

With that stance, having .sha3sum always do whole-field digest calculation would be fine, with no "do it the usually fine way" option. And if those who start getting different result dislike it and complain, they can be told that the tool was never meant to hide DB differences.

BTW, I will happily argue both sides of a question, and often take the "other" side when it seems an issue can be better developed that way.