WAL mode: No VFS writes on first sqlite3_step of statement with RETURNING

(1.2) By Cole Miller (colemiller) on 2023-02-14 20:44:45 edited from 1.1 [source]

Hi -- I'm finding myself surprised by the behavior of the first call to sqlite3_step on a prepared statement like INSERT INTO test (n) VALUES (123) RETURNING * when the database is in WAL mode. Here's a reproducer script:

#!/bin/sh

# No arguments. Run in a directory that contains sqlite3.c, sqlite3.h, and vfstrace.c.

set -e

rm -f prog.c prog foo.sqlite3* *.log

cat >prog.c <<EOF
#include "sqlite3.h"

#include <assert.h>
#include <stdio.h>

int vfstrace_register(
	const char *zTraceName,
	const char *zOldVfsName,
	int (*xOut)(const char*,void*),
	void *pOutArg,
	int makeDefault
);

int main(void) {
	sqlite3 *db;
	sqlite3_stmt *stmt;
	int rv;

	rv = vfstrace_register("tracing", "unix", (void *)fputs, stderr, 1);
	assert(rv == 0);
	rv = sqlite3_open("foo.sqlite3", &db);
	assert(rv == 0);
	rv = sqlite3_exec(db, "PRAGMA journal_mode=WAL", NULL, NULL, NULL);
	assert(rv == 0);
	rv = sqlite3_exec(db, "CREATE TABLE test (n INT)", NULL, NULL, NULL);
	rv = sqlite3_prepare_v2(db, "INSERT INTO test (n) VALUES (123) RETURNING *", -1, &stmt, NULL);
	assert(rv == 0);
	do {
		fprintf(stderr, "###\n### sqlite3_step\n###\n");
		rv = sqlite3_step(stmt);
	} while (rv == SQLITE_ROW);
	assert(rv == SQLITE_DONE);

	return 0;
}
EOF

cc -o prog prog.c sqlite3.c vfstrace.c
./prog 2>with-returning.log

rm -f prog foo.sqlite3*

sed -i -e 's/ RETURNING \*//g' prog.c

cc -o prog prog.c sqlite3.c vfstrace.c
./prog 2>without-returning.log

Running against SQLite v3.40.1, I get this trace in with-returning.log (trimming the stuff that happens before any sqlite3_step):

###
### sqlite3_step
###
tracing.xShmBarrier(foo.sqlite3)
tracing.xShmLock(foo.sqlite3,ofst=4,n=1,LOCK|EXCLUSIVE) -> SQLITE_OK
tracing.xShmLock(foo.sqlite3,ofst=4,n=1,UNLOCK|EXCLUSIVE) -> SQLITE_OK
tracing.xShmLock(foo.sqlite3,ofst=4,n=1,LOCK|SHARED) -> SQLITE_OK
tracing.xShmBarrier(foo.sqlite3)
tracing.xShmLock(foo.sqlite3,ofst=0,n=1,LOCK|EXCLUSIVE) -> SQLITE_OK
###
### sqlite3_step
###
tracing.xWrite(foo.sqlite3-wal,n=24,ofst=8272) -> SQLITE_OK
tracing.xWrite(foo.sqlite3-wal,n=4096,ofst=8296) -> SQLITE_OK
tracing.xSync(foo.sqlite3-wal,FULL) -> 0
tracing.xShmBarrier(foo.sqlite3)
tracing.xShmLock(foo.sqlite3,ofst=0,n=1,UNLOCK|EXCLUSIVE) -> SQLITE_OK
tracing.xFileControl(foo.sqlite3,22) -> 12
tracing.xShmLock(foo.sqlite3,ofst=4,n=1,UNLOCK|SHARED) -> SQLITE_OK

And this trace in without-returning.log (ditto):

###
### sqlite3_step
###
tracing.xShmBarrier(foo.sqlite3)
tracing.xShmLock(foo.sqlite3,ofst=4,n=1,LOCK|EXCLUSIVE) -> SQLITE_OK
tracing.xShmLock(foo.sqlite3,ofst=4,n=1,UNLOCK|EXCLUSIVE) -> SQLITE_OK
tracing.xShmLock(foo.sqlite3,ofst=4,n=1,LOCK|SHARED) -> SQLITE_OK
tracing.xShmBarrier(foo.sqlite3)
tracing.xShmLock(foo.sqlite3,ofst=0,n=1,LOCK|EXCLUSIVE) -> SQLITE_OK
tracing.xWrite(foo.sqlite3-wal,n=24,ofst=8272) -> SQLITE_OK
tracing.xWrite(foo.sqlite3-wal,n=4096,ofst=8296) -> SQLITE_OK
tracing.xSync(foo.sqlite3-wal,FULL) -> 0
tracing.xShmBarrier(foo.sqlite3)
tracing.xShmLock(foo.sqlite3,ofst=0,n=1,UNLOCK|EXCLUSIVE) -> SQLITE_OK
tracing.xFileControl(foo.sqlite3,22) -> 12
tracing.xShmLock(foo.sqlite3,ofst=4,n=1,UNLOCK|SHARED) -> SQLITE_OK

It seems that when the RETURNING clause is present, writing to the WAL is deferred until the second call to sqlite3_step. I was surprised by this because the documentation states:

When a DELETE, INSERT, or UPDATE statement with a RETURNING clause is run, all of the database changes occur during the first invocation of sqlite3_step().

Is this expected behavior in WAL mode?

(2) By Keith Medcalf (kmedcalf) on 2023-02-14 21:01:02 in reply to 1.2 [link] [source]

Richard can probably give a better answer or correct if this is wrong.

However, you appear to be talking about writes to the filesystem. Writes to the filesystem only occur at commit time, not willy-nilly (unless cache is exhausted and pages are spilled).

This means that although all the changes are "made" during the first step, they are not "written to disk" until commit time (this is the end of the statement which made the changes, or when a COMMIT statement executes, assuming that autocommit is turned off) -- just before SQLITE_DONE is returned.

A statement is a "generator" which produces rows. If the statement does "updates" to the underlying tables, those "updates" are done during the very first step. Without RETURNING, the statement completes without returning any rows and the autocommit machinery writes the changes to storage just before the "generator" exits and returns SQLITE_DONE.

With RETURNING, the "updates" are still done in the first step, however, you now have to "returning" rows. So each returning row is returned with SQLITE_ROW. When the statement is stepped after the last row is retrieved, the "generator" is finished "generating" and prepares to exit. This causes the autocommit machinery to execute which writes the changes to storage. Then the statement/generator/vm terminates and returns SQLITE_DONE.

(4) By Cole Miller (colemiller) on 2023-02-14 21:23:32 in reply to 2 [link] [source]

Thanks, that makes sense.

(6) By Cole Miller (colemiller) on 2023-02-14 22:06:15 in reply to 2 [link] [source]

A follow-up question: is it possible to force a page cache flush after the first call to sqlite3_step (without using the experimental application-defined cache feature)? I've found that sqlite3_db_cacheflush won't do the trick, presumably because the page in question is "used" by the statement being stepped.

(7) By Keith Medcalf (kmedcalf) on 2023-02-14 23:28:45 in reply to 6 [link] [source]

What are you trying to accomplish by flushing the cache?

(9) By Cole Miller (colemiller) on 2023-02-15 17:10:39 in reply to 7 [link] [source]

We have a custom in-memory VFS that records writes to the WAL and replicates the written pages over the network. The application that manages the SQLite database gets requests to execute SQL statements from clients, and when statement execution yields some rows, we want to send them back to the client in question. Historically, we haven't supported executing statements like DELETE FROM ... RETURNING * that both modify the database and return rows; I'm working on adding support for such statements. Since we don't want to return any rows to the client until all concomitant database changes have been replicated, it would be useful to arrange that all those changes happen on the first call to sqlite3_step, so that we don't have to buffer up all the yielded rows while waiting for SQLITE_DONE. (It wouldn't be the end of the world if it turns out we do have to buffer the rows, especially since SQLite apparently does such buffering internally for RETURNING -- but if there's a sane and convenient way to flush the cache after the first sqlite3_step, I'd like to explore that as well.)

(8) By Keith Medcalf (kmedcalf) on 2023-02-15 02:27:47 in reply to 2 [link] [source]

Note that even if you reset the statement (with sqlite3_reset/sqlite3_finalize) after the first step you will only abort the generation of the "returning" generator. The changes have already been made and will be committed when the transaction commits (either implicitly or explicity).

(3) By Richard Hipp (drh) on 2023-02-14 21:06:10 in reply to 1.1 [link] [source]

Section 2.2 has been added to the documentation in an effort to clarify what is going on.

(5) By Cole Miller (colemiller) on 2023-02-14 21:24:00 in reply to 3 [link] [source]

I appreciate the clarification!