IS NULL optimization on NOT NULL constraints breaks on corrupted databases

(1) By Even Rouault (rouault) on 2021-11-19 21:25:31 [link]

Create a simulated corrupted foo.db with
```
create table t(id integer primary key not null,v integer);
insert into t values(1,NULL);
PRAGMA writable_schema=ON;
update sqlite_master set sql='create table t(id integer primary key not null,v integer not null);' where name='t';
```
with sqlite 3.35 or later,
```
./sqlite3 foo.db "select id, v from t where v is not null"
```
returns
```
1|
```
The expected result would have for it to not return that row, as in previous versions.

This causes a regression in the GDAL open source project which trusts the "v is not null" and dereferences values in that field without testing them against NULL.

git bisects on the git mirror points to the following commit:

```
commit 8ddf686267c58ee0e519cfd9024f177e606be0ab
Author: dan <Dan Kennedy>
Date:   Fri Feb 26 20:14:32 2021 +0000

    Attempt to optimize "x IS NULL" and "x IS NOT NULL" expressions when x is a column with a NOT NULL constraint.
    
    FossilOrigin-Name: 5ecd842555009ce27ee6390325ac5c2504143474b12b730933f0833b3dad788a
```

(2) By Richard Hipp (drh) on 2021-11-19 21:42:04 in reply to 1 [link]

> This causes a regression in the GDAL open source project

Please help me to understand why the GDAL project is concerned that you
are getting an incorrect answer from a query against an admittedly corrupt
database file?  Why is this not just a case of [GIGO][1]?

[1]: https://en.wikipedia.org/wiki/Garbage_in,_garbage_out

(3) By Keith Medcalf (kmedcalf) on 2021-11-19 21:56:56 in reply to 1 [link]

What do you expect to happen when **you** corrupt the database?


```
sqlite> pragma integrity_check;
┌───────────────────┐
│  integrity_check  │
├───────────────────┤
│ NULL value in t.v │
└───────────────────┘
```

The closest analogy is that **you** put an advertizement in the newspaper to sell your green car.  However, before the advertizement was printed, **you** painted the car blue.  However, the advertisement, when it appeared, was still tying to sell a green car.

**You** are not entitled to claim that the advertizement is now incorrect.

If **you** deliberately and with aforethought decide to make the database *inconsistent* and *incoherent*, and cannot be bothered to ensure the integrity of the database following you *coniptions*, they why would you place the blame for **your** failure on anyone else?

This reminds me of the old "Doctor, Doctor, in hurts when I do this" the patient complains as he pokes himself in the eye.  "Well do not do that then" replies the Doctor.

I believe the appropriate observation is that you been **"hoisted by your own petard"**.

(4) By Even Rouault (rouault) on 2021-11-19 22:46:20 in reply to 2 [link]

The issue here was that GDAL segfaulted when opening that database because of the null dereference of a corrupted record. And GDAL cannot know in advance that the database is corrupted (the issue was discovered by OSSFuzz actually). Running integrity checking on database opening could be very slow on databases of several tens of gigabytes.
I've implemented a workaround on application side, but I just wanted to raise the issue as I was surprised by this change of behavior. Having SQLite return an error on that request would also have been a reasonable behavior, but here it behaves silently as if everything was correct.

(5) By Richard Hipp (drh) on 2021-11-20 02:48:11 in reply to 4 [link]

> The issue here was that GDAL segfaulted when opening that database because of the null dereference of a corrupted record. 

Did SQLite segfault?  If so that would be a bug that we need to look into.

Or, did GDAL do an sqlite3_column_text() and get back a NULL pointer, then
dereference the NULL pointer, thus causing a segfault.  That would be a bug
in GDAL.  The sqlite3_column_text() might return NULL, even for a NOT NULL
column, for example following an OOM error.  That's part of the specification
of how sqlite3_column_text() works.  If you were not checking for a NULL return
from sqlite3_column_text() and it always worked before, that is because you
were lucky.  It would be to same if you failed to check for a NULL return
from malloc().

(6.2) By Even Rouault (rouault) on 2021-11-20 11:02:41 edited from 6.1 in reply to 5 [link]

> did GDAL do an sqlite3_column_text() and get back a NULL pointer, then dereference the NULL pointer, thus causing a segfault. That would be a bug in GDAL. The sqlite3_column_text() might return NULL, even for a NOT NULL column, for example following an OOM error

Fair enough. Here, it was sqlite3_get_table() that was used. Can it return a success error code and return NULL in one of the pazResult[] items, in situtations where this would normally not happen (like an OOM error as you mentioned) ?

(7) By Richard Hipp (drh) on 2021-11-20 11:35:28 in reply to 6.2 [link]

No, sqlite3_get_table() will not invoke the callback after hitting an OOM.
So that particular failure mode will not happen with sqlite3_get_table().
If one of the pazResult[] items passed into sqlite3_get_table() is NULL, that
means the value really is NULL.  Or, at least that is what it appears to me,
assuming I didn't overlook something.

Even so, applications should not rely on a NOT NULL constraint in the schema
to ensure that a pazResult[] value is never NULL.  An attacker might change the
schema of the database, without the application knowing it, to remove the
NOT NULL constraint, for example.  It does not require a corrupt database
to cause the problems in GDAL - just a database with a maliciously modified
schema.

(8) By Scott Robison (casaderobison) on 2021-11-20 11:41:39 in reply to 6.2 [link]

To add to what drh wrote, from the documentation for sqlite3_get_table() (both of which he probably also wrote):

> NULL values result in NULL pointers. All other values are in their UTF-8 zero-terminated string representation as returned by sqlite3_column_text().

So the documented interface says it is possible for the table to include NULL values and it does not allow wiggle room for "unless the schema" exceptions.

(9) By Even Rouault (rouault) on 2021-11-20 13:22:48 in reply to 7 [link]

> Even so, applications should not rely on a NOT NULL constraint in the schema to ensure that a pazResult[] value is never NULL. An attacker might change the schema of the database, without the application knowing it, to remove the NOT NULL constraint, for example. It does not require a corrupt database to cause the problems in GDAL - just a database with a maliciously modified schema.

Right, that's why my query included a "WHERE ... IS NOT NULL" clause, hoping that it would not return NULL pasResult[] values in the column that corresponds to the IS NOT NULL where clause, even if the database was corrupted. But with the new optimization, this WHERE clause is ignored when the table schema has a NOT NULL constraint.

Thanks for the feedback. No need to spend more energy on this as this is something that can be dealt on application side.

(10) By Richard Hipp (drh) on 2021-11-20 13:23:24 in reply to 1 [link]

The [documentation has been updated][1] to make it clearer and to state more
directly that constraints are checked during changes to the database but not
during queries from the database.

[1]: https://www.sqlite.org/docsrc/info/23603e03cdaffd96

(11) By MBL (UserMBL) on 2021-11-20 13:29:15 in reply to 1

>
>~~~
>PRAGMA writable_schema = boolean;
>
>When this pragma is on, and the SQLITE_DBCONFIG_DEFENSIVE flag is off, then the sqlite_schema table can be changed using ordinary UPDATE, INSERT, and DELETE statements.
>
>Warning: misuse of this pragma can easily result in a corrupt database file.
>~~~
>

see [pragma writable_schema](https://sqlite.org/pragma.html#pragma_writable_schema) in documentation.

Inserting data and then invalidating the constraints by this pragma is clearly a misuse from my point of view.
You should have cleaned the data before adaptation of the schema for this change. The schema change is not doing that.

(12) By Even Rouault (rouault) on 2021-11-20 14:45:11 in reply to 11 [link]

I know that changing the schema this way resulted in a corrupted database: I just used it as a simple way of creating a corrupted database for reproducibility (my real use case was a untrusted database generating through fuzzing)

The point I wanted to underline in this thread is that a query with "WHERE v IS NOT NULL" can now return rows with v being NULL on a corrupted database, when the table definition has a NOT NULL constraint on v. Which isn't obvious, and a change of behavior with respect to older SQLite versions

(13) By Larry Brasfield (larrybr) on 2021-11-20 17:00:23 in reply to 12 [link]

> \[Regarding effects of changing sqlite_schema table via UPDATE\] ...<br>
Which isn't obvious, and a change of behavior with respect to older SQLite versions

The effect of making your own changes to the schema table is undocumented and undefined, and always has been. Changes in undocumented behavior are sometimes intentionally avoided as the project evolves, but not always and not in any way that users should rely upon.

Your effort to inform about this particular change of behavior is appreciated. You may consider it a warning about reliance upon undocumented behavior.

(14.1) By Scott Robison (casaderobison) on 2021-11-21 00:07:30 edited from 14.0 in reply to 13 [link]

It seems to me that multiple people in this thread have been fixated on the reproduction example he gave vs the spirit of the report. The original report did see a legitimately corrupted database that did not use the undocumented and undefined behavior of rewriting the schema, but writing the schema was used to provide a simple test case. We always appreciate reproduction steps.

As a result, there was a dependence on NOT NULL on a column that by definition could not be NULL, but due to the bug was now NULL.

This results in a paradox. The schema says "this column can't be NULL" yet "something is NULL". So what is the query planner to do? The QP can do anything at all at that point because by some definition, anything it does will be wrong.

I think what SQLite is doing is perfectly acceptable and the real problem is the "legitimately corrupted database" they were dealing with. I can see an argument for "engine should always sanity check data" but then the query planner can never optimize anything because it can no longer make assumptions about validity. "NOT NULL" seems like a simple case, but one can imagine very complex CHECK constraints that would be even harder to verify at SELECT time for every query.

The most sensible thing is "INSERT is responsible for ensuring constraints are enforced" and that the rest of the engine can safely assume the data meets that criteria. If something is deliberately or accidentally corrupted, I don't think it should be SELECT/UPDATE/DELETE's job to enforce that.

But the question expressing surprise is not unreasonable, and the writable schema was a red herring.

(15) By Larry Brasfield (larrybr) on 2021-11-21 02:33:01 in reply to 14.1 [link]

You make a good point regarding the spirit versus ostensible thrust of the OP's original post. I have to admit being unsure what its point really was<sup>a</sup>, and latching onto its most obvious aspect.

However, taking "revised schema only" as a stand-in for "data altered to no longer be consistent with schema", my point about the discrepancy remains: A corrupt DB, whether made corrupt by simple, repeatable or mysterious means, is not going to have defined behavior or even behavior that should be expected to remain the same across SQLite releases.

There have been many recent changes to how the library detects and responds to DB corruption. They help avoid (further) data loss and ease the task of avoiding/reducing undefined behavior in the C/C++ UB sense. This kind of evolution simply cannot be confined to perpetuate the behavior (or misbehavior) of past releases under such conditions.

Your point about sanity checks versus performance is well taken.

----
a. During my perplexity on the point, I've been tempted to parody it, along lines of" "I crossed out 'Apples' on this bag of fruit and wrote 'Oranges', then pulled an apple out." But it was too crude and not quite on point. And likely to seem unkind.

(16) By Keith Medcalf (kmedcalf) on 2021-11-21 03:17:51 in reply to 15 [link]

>  During my perplexity on the point, I've been tempted to parody it, along lines of" "I crossed out 'Apples' on this bag of fruit and wrote 'Oranges', then pulled an apple out." But it was too crude and not quite on point. And likely to seem unkind.

Actually, that is a very good analogue, in more ways than one!

The 'bag of fruit' represents a whole database.  
The 'label' represents the database schema.  
The 'apple' represents the database contents.

So if you receive a 'bag of fruit' that is labelled 'oranges', and pull out an 'apple' you can conclude that the 'bag of fruit' is corrupt.  Clearly when receiving a 'bag of fruit' from a 'source of ill-repute' if might be advantageous to check that the contents of the bag match the label.

Luckily in SQLite3 there is command(s) to automate this inspection to ensure that the contents of the bag match the label -- `pragma integrity_check` and `pragma foreign_key_check` -- if both of these pass then the label (schema) and contents (data) are consistent and not corrupt.