batch atomic and synchronous=off is sketchy

(1.2) By Roy Hashimoto (rhashimoto) on 2023-10-27 23:57:17 edited from 1.1 [link] [source]

There seem to be some bad interactions with a VFS that advertises SQLITE_IOCAP_BATCH_ATOMIC and PRAGMA synchronous=OFF on 3.43.1. It looks like this can eventually result in SQLite reporting "database disk image is malformed" (SQLITE_CORRUPT).

I haven't been able to track down the cause of my reproducible case of SQLITE_CORRUPT, but I do have a line on some odd behavior that may be related. The strange behavior is that when synchronous=OFF, batch atomic write is not used but no rollback journal is (usually) used either. Here's a sample log of VFS method calls for a single statement write transaction:

INSERT INTO t1 VALUES(1, 19147, 'nineteen thousand one hundred forty-seven');
xLock /demo 1
xAccess /demo-journal 0
xRead /demo 16 24
xAccess /demo-wal 0
xFileSize /demo
xLock /demo 2
xDeviceCharacteristics
xFileControl /demo 20
xLock /demo 4
xDeviceCharacteristics
xWrite /demo 4096 0
xWrite /demo 4096 4096
xFileControl /demo 21
xFileControl /demo 22
xUnlock /demo 1
xDeviceCharacteristics
xUnlock /demo 0

Note that there are no xFileControl batch atomic codes, and neither is a rollback journal used.

This happens because SQLite disables batch atomic mode when synchronous mode is off with this logic. But the same logic is not used in jrnlBufferSize(), and that is used to set nSpill which makes SQLite always use a memory journal. As I read the code, anyway.

Is this the intended behavior? It does pretty much guarantee corruption if there is a crash in the middle of writing, whereas if a rollback journal were used there is at least a chance it made it onto storage even without a sync. I understand that you get what's coming to you with synchronous=OFF but my guess is this isn't purposeful.

Crashing isn't the source of the corruption in my issue, but I'm thinking it likely has a similar cause within SQLite itself, and maybe even the same cause. I don't really care if this gets fixed - I'm in the "no one should ever do this" camp - but I did want to report it for posterity, and maybe someone should check to see what happens with journaling in other cases that disable batch atomic mode, like when a super-journal is used.

(2.1) By Roy Hashimoto (rhashimoto) on 2023-10-29 23:26:42 edited from 2.0 in reply to 1.2 [source]

I have a test case where I believe SQLite is vulnerable to corruption with full synchronization. It requires a batch atomic write VFS, and this particular test uses FTS5.

Here's the top-level Linux code:

#include <unistd.h>
#include <stdio.h>
#include <sqlite3.h>

extern int vfstrace_register(
   const char *zTraceName,           /* Name of the newly constructed VFS */
   const char *zOldVfsName,          /* Name of the underlying VFS */
   int (*xOut)(const char*,void*),   /* Output routine.  ex: fputs */
   void *pOutArg,                    /* 2nd argument to xOut.  ex: stderr */
   int makeDefault                   /* True to make the new VFS the default */
);

#define EXEC(SQL) do { \
  printf("%s\n", SQL); \
  rc = sqlite3_exec(db, SQL, callback, 0, &zErrMsg); \
  if( rc!=SQLITE_OK ){ \
    fprintf(stderr, "SQL error: %s\n", zErrMsg); \
    sqlite3_free(zErrMsg); \
  } \
} while(0)

static int callback(void *NotUsed, int argc, char **argv, char **azColName){
  int i;
  for(i=0; i<argc; i++){
    printf("%s = %s\n", azColName[i], argv[i] ? argv[i] : "NULL");
  }
  printf("\n");
  return 0;
}

int main(int argc, char **argv){
  sqlite3 *db;
  char *zErrMsg = 0;
  int rc;

  sqlite3_initialize();
  vfstrace_register("trace", NULL, fputs, stdout, 1);

  unlink(argv[1]);
  rc = sqlite3_open(argv[1], &db);
  if( rc ){
    fprintf(stderr, "Can't open database: %s\n", sqlite3_errmsg(db));
    sqlite3_close(db);
    return(1);
  }

  EXEC("PRAGMA page_size = 4096;");
  EXEC("DROP TABLE IF EXISTS test_fts;");
  EXEC("CREATE VIRTUAL TABLE IF NOT EXISTS test_fts USING fts5(fieldA, fieldB, fieldC, fieldD, fieldE, id UNINDEXED, prefix=2, detail=none);");
  for (int i = 0; i < 877; ++i) {
    char sql[1024];
    sprintf(sql, "INSERT INTO test_fts (fieldA, fieldB, fieldC, fieldD, fieldE, id) VALUES ('test', 'test', 'test', 'test', 'test', '%d');", i);
    EXEC(sql);
  }
  sqlite3_close(db);
  return 0;
}

The VFS is available here. This is just a lightly hacked test_vfstrace.c (the example logging shim VFS) with these changes:

The return value of xDeviceCharacteristics is augmented with SQLITE_IOCAP_BATCH_ATOMIC.
The return value of xFileControl is changed to SQLITE_OK for SQLITE_FCNTL_{BEGIN,COMMIT,ROLLBACK}_ATOMIC_WRITE.

These hacks falsely tell SQLite that writes in a batch group are atomic, which should be harmless as long as no crash or rollback occurs.

Build like this:

gcc -o batest -I. -DSQLITE_ENABLE_BATCH_ATOMIC_WRITE -DSQLITE_ENABLE_FTS5 batest.c batest_vfstrace.c sqlite3.c -lm

And run like this:

./batest batest.db > batest.log

The log for the last INSERT transaction looks like this (line numbers added):

 22861	INSERT INTO test_fts (fieldA, fieldB, fieldC, fieldD, fieldE, id) VALUES ('test', 'test', 'test', 'test', 'test', '876');
 22862	trace.xLock(abc.db,SHARED) -> SQLITE_OK
 22863	trace.xAccess("/home/dev/wa-sqlite/deps/sqlite-amalgamation-3430200/abc.db-journal",0) -> SQLITE_OK, out=0
 22864	trace.xRead(abc.db,n=16,ofst=24) -> SQLITE_OK
 22865	trace.xAccess("/home/dev/wa-sqlite/deps/sqlite-amalgamation-3430200/abc.db-wal",0) -> SQLITE_OK, out=0
 22866	trace.xFileSize(abc.db) -> SQLITE_OK, size=69632
 22867	trace.xLock(abc.db,RESERVED) -> SQLITE_OK
 22868	trace.xDeviceCharacteristics(abc.db) -> 0x00005000
 22869	trace.xFileControl(abc.db,20) -> SQLITE_OK
 22870	trace.xDeviceCharacteristics(abc.db) -> 0x00005000
 22871	trace.xLock(abc.db,EXCLUSIVE) -> SQLITE_OK
 22872	trace.xDeviceCharacteristics(abc.db) -> 0x00005000
 22873	trace.xDeviceCharacteristics(abc.db) -> 0x00005000
 22874	trace.xFileControl(abc.db,BEGIN_ATOMIC_WRITE) -> overridden
 22875	trace.xFileControl(abc.db,SIZE_HINT,77824) -> SQLITE_OK
 22876	trace.xWrite(abc.db,n=4096,ofst=0) -> SQLITE_OK
 22877	trace.xWrite(abc.db,n=4096,ofst=4096) -> SQLITE_OK
 22878	trace.xWrite(abc.db,n=4096,ofst=8192) -> SQLITE_OK
 22879	trace.xWrite(abc.db,n=4096,ofst=57344) -> SQLITE_OK
 22880	trace.xWrite(abc.db,n=4096,ofst=65536) -> SQLITE_OK
 22881	trace.xWrite(abc.db,n=4096,ofst=69632) -> SQLITE_OK
 22882	trace.xFileControl(abc.db,COMMIT_ATOMIC_WRITE) -> overridden
 22883	trace.xFileSize(abc.db) -> SQLITE_OK, size=73728
 22884	trace.xFileControl(abc.db,SIZE_HINT,77824) -> SQLITE_OK
 22885	trace.xWrite(abc.db,n=4096,ofst=73728) -> SQLITE_OK
 22886	trace.xFileControl(abc.db,21) -> 12
 22887	trace.xSync(abc.db,FULL) -> 0
 22888	trace.xFileControl(abc.db,22) -> 12
 22889	trace.xUnlock(abc.db,SHARED) -> SQLITE_OK
 22890	trace.xDeviceCharacteristics(abc.db) -> 0x00005000
 22891	trace.xUnlock(abc.db,NONE) -> SQLITE_OK
 22892	trace.xFileControl(abc.db,20) -> SQLITE_OK
 22893	trace.xDeviceCharacteristics(abc.db) -> 0x00005000
 22894	trace.xUnlock(abc.db,NONE) -> SQLITE_OK
 22895	trace.xClose(abc.db) -> SQLITE_OK

The problem is at line 22885, which is a write outside of the batch atomic group on lines 22874-22882. If there is a crash between committing the group and the problem write then the database file will be inconsistent - at the least the page count in the header will be wrong (because that is an appending write). It's a small window but that looks like a potential vulnerability to me.

Addendum: Note that if you stick EXEC("PRAGMA synchronous=OFF;"); into the test program then you can use it to reproduce the lack of a rollback journal described in the top post.

(4) By Richard Hipp (drh) on 2023-10-30 13:05:42 in reply to 2.1 [link] [source]

Issue should now be resolved on trunk. Simplified demonstration of the problem:

#include <unistd.h>
#include <stdio.h>
#include <sqlite3.h>

extern int vfstrace_register(const char*,const char*,int(*)(const char*,FILE*),void*,int);

#define EXEC(SQL) do { \
  printf("%s\n", SQL); \
  rc = sqlite3_exec(db, SQL, 0, 0, &zErrMsg); \
  if( rc!=SQLITE_OK ){ \
    fprintf(stderr, "SQL error: %s\n", zErrMsg); \
    sqlite3_free(zErrMsg); \
  } \
} while(0)

int main(int argc, char **argv){
  sqlite3 *db;
  char *zErrMsg = 0;
  int rc;
  sqlite3_initialize();
  vfstrace_register("trace", NULL, fputs, stdout, 1);
  unlink(argv[1]);
  rc = sqlite3_open(argv[1], &db);
  if( rc ){
    fprintf(stderr, "Can't open database: %s\n", sqlite3_errmsg(db));
    return(1);
  }
  EXEC("CREATE TABLE t1(x);");
  EXEC("BEGIN;");
  EXEC("INSERT INTO t1 VALUES(randomblob(100000));");
  EXEC("DELETE FROM t1 WHERE rowid=last_insert_rowid();");
  EXEC("COMMIT;");
  sqlite3_close(db);
  return 0;
}

(3) By Roy Hashimoto (rhashimoto) on 2023-10-30 12:55:18 in reply to 1.2 [link] [source]

Crashing isn't the source of the corruption in my issue, but I'm thinking it likely has a similar cause within SQLite itself, and maybe even the same cause.

The non-crash SQLITE_CORRUPT case I mention at the beginning of the top post was due to a bug in my batch atomic VFS. It is related to the unexpected SQLite library behavior with batch atomic under discussion only because that happened to expose my bug.