Besides the page, what is SQLite flushing?

(1) By Timhe95 on 2022-06-29 17:12:44 [source]

I am trouble shooting a performance problem of write-intensive workload, so first, I explore what is SQLite writing down to the device

benchmark tooldb_bench_sqlite3 --page_size=32768
test: fillseqsync

strace:

pwrite64(4, "\0\0\0\2\0\0\0\2\210\375\4\336\243\310\320\236s\205\30\253\267Z\302\4", 24, 1278920) = 24
pwrite64(4, "\n\0\0\0!pg\0w\5xp|\261v\214|8w\367v\23u!z\315qYr\304rK"..., 32768, 1278944) = 32768
fdatasync(4)                            = 0
pwrite64(4, "\0\0\0\2\0\0\0\2\210\375\4\336\243\310\320\236\211^\203%h\274\327\0", 24, 1311712) = 24
pwrite64(4, "\n\0\0\0\"o\356\0w\5xp|\261v\214|8w\367v\23u!z\315qYr\304rK"..., 32768, 1311736) = 32768
fdatasync(4)                            = 0
pwrite64(4, "\0\0\0\2\0\0\0\2\210\375\4\336\243\310\320\236@\325e\266\211\253\242/", 24, 1344504) = 24
pwrite64(4, "\n\0\0\0#ou\0w\5xp|\261v\214|8w\367v\23u!z\315qYr\304ou"..., 32768, 1344528) = 32768
fdatasync(4)                            = 0

I can understand the pwrite64(.., 32768, ...) is writing the data page, but what is the pwrite64(.., 24, ...) actually writing? WAL? why WAL is not arranged in page?

(2) By Simon Slavin (slavin) on 2022-06-29 22:49:17 in reply to 1 [link] [source]

WAL cannot be in page. WAL is where your data is written before the real database file is updated. It is needed to manage multi-user and multi-program access, and to ensure that if your computer crashes no data is lost. For more information see

https://sqlite.org/wal.html

(3.1) By Timhe95 on 2022-06-30 02:47:59 edited from 3.0 in reply to 2 [link] [source]

Thanks for the answer. I am still confused that it prevent data loss in case of crash, why the fdatasync after the data page has been written? I suppose with WAL, the order should be:

pwrite64(4,       WAL,    24, 1278920) = 24
fdatasync(4)                           = 0
pwrite64(4, data page, 32768, 1278944) = 32768

as the comment said:

On a checkpoint, the WAL is first VFS.xSync-ed, then valid content of the WAL is transferred into the database, then the database is VFS.xSync-ed. The VFS.xSync operations serve as write barriers - all writes launched before the xSync must complete before any write that launches after the xSync begins.

And why not to persist WAL in another file (currently both fd=4).

(4) By Rowan Worth (sqweek) on 2022-06-30 04:49:43 in reply to 3.1 [link] [source]

I don't think your straces here are looking at a CHECKPOINT. I'm not a WAL expert by any means but I'd say that fd=4 is your -wal file and that each transaction is appending a 24-byte header followed by the contents of modified pages. Later these will be merged to the main DB (probably fd=3?) via a CHECKPOINT.

I suspect there's no point syncing the WAL header without the associated data because in the event of a crash/power outage you need to be able to have the whole transaction persisted to disk or roll back to the previous transaction.

(5) By Timhe95 on 2022-06-30 09:19:17 in reply to 4 [link] [source]

That really makes sense. Since at each fdatasync, there are two writes (one is in page_size; another is in 24B and will be padding to 4KB by OS, observed by blktrace), but my device (WD-SN850) block size (min-write-unit) is 32KB, where un-aligned write would be slow, just like this one.

I believe many new devices have larger block size, so maybe it is possible that SQLite can split WAL log into a separate directory. e.g., changing from data_store_directory/0001.wal to data_store_directory/wal/0001.wal, and add something like a PRAGMA to control this behavior. In this way, I can make a symbol link of wal to another device. I think this may provide a more flexible env for better performance.

Just like what has been done in MongoDB: directoryForIndexes

Maybe in some versions in the future, would SQLite could take that into consideration?

Thanks!

(6) By Rowan Worth (sqweek) on 2022-06-30 09:41:19 in reply to 5 [link] [source]

In my experience fdatasync is usually the bottleneck -- lots of small transactions are the worst for this so best to batch changes together where possible.

You might find PRAGMA journal_mode=rollback works better in your case as all writes would be aligned in that case. But this changes concurrency semantics; readers will sometimes encounter SQLITE_BUSY when a change is pending and have to retry once the commit has completed. The writer also has to wait for readers to clear so a single long transaction can cause havoc in this mode.

(7) By Timhe95 on 2022-06-30 12:08:06 in reply to 6 [link] [source]

Sure! my benchmark tells me fsync/fdatasync is the root cause. Batching is very encouraging improving direction, so in case WAL can not fit in page and we want to keep the benefits that the read concurrency WAL provides, maybe there may be some intermediate solution. We do have a optane 905p that bear small write but with only limited capacity. So I may like to have a workaround like symbol link.