SQLite Forum

[Feature Request] make SQLite support asynchronous io

[Feature Request] make SQLite support asynchronous io

(1) By Timhe95 on 2021-05-01 13:44:57 [link] [source]

I am not sure this is the right place to ask for a feature request.

With PRAGMA synchronous=FULL, I can make sure the data will be persisted to disk at every operation. This is expected to slow down the execution. I suppose sqlite implements this behavior using wirte(2) + fsync(2).


However, the latency of fsync(2) depends on the underlying disk. For example, in some of my NVMe SSD (i.e., Samsung-980pro, WD-SN850), this latency is long: experiment result with fio
This result is obtained by

fio --filename={/dev/nvme0n1, /dev/nvme1n1, /dev/sda} --runtime=15 --size=400G --direct=1 --buffered=0 --rw=randwrite --bs=4k --numjobs=16 --iodepth=1 --group_reporting --name=write --ioengine=sync --fsync={1, 0} --sync={0, 1}

As shown, these two NVMe SSD has longer latency than SATA SSD (Samsung-860evo). One possible reason for this may be NVMe SSD accelerate IO whose io-depth is large (e.g., 32) but when using the synchronous io, the io-depth will degenerate to exactly 1.


Given this observation, I find when I use PRAGMA synchronous=FULL, some workload (fillrandom in db_bench_sqlite3) runs slower in NVMe SSD (WD-SN850) than in SATA SSD (Intel-S4510) experiment result with db_bench_sqlite3 Focus on the light-blue bar VS. dark-green bar when synchronous=FULL/EXTRA. This behavior is more severe in Samsung-980pro (not drawn in the figure)

Possible Solution

when I change the fio parameter from synchronous io to libaio, something changes. Results. As shown, when using O_SYNC and write_aio(), NVMe SSDs become faster than SATA SSD as expected (not understand why write_aio() + fsync() is still problematic). So one possible solution for this problem is to make Sqlite3 support asynchronous IO like libaio (io_setup(2), io_destroy(2), io_submit(2), io_cancel(2), io_getevents(2)) or something new like io_uring.


The asynchronous IO is more friendly to the new NVMe storage devices. Why not to make SQLite3 support it ;)

(2) By Clemens Ladisch (cladisch) on 2021-05-02 15:04:59 in reply to 1 [link] [source]

I do not understand why fsync would be faster when issued through libaio, or with what you would replace it. Can you show how you would implement this in a VFS?

(3) By Timhe95 on 2021-05-03 02:44:34 in reply to 2 [link] [source]

Thanks for the reply.

It seems that this behavior (libaio + fsync) is not clear in Linux:

[1] https://github.com/axboe/fio/issues/1179
[2] https://github.com/axboe/fio/blob/fio-3.25/engines/libaio.c#L256-L268

The point I want to make actually isn't to use libaio + fsync. Rather, it is to make SQLite support asynchronous IO. libaio is just one of the aio engines. uring io is another good one.

The reason is that new storage media like NVMe SSD provides a highly paralleled IO structure. Only the aio can utilize this, while synchronous IO makes IOs become serialized.

AFAK, (this libaio function)[https://man7.org/linux/man-pages/man2/io_submit.2.html]: int io_submit(aio_context_t ctx_id, long nr, struct iocb **iocbpp); In the argument iocb, there is a field aio_rw_flags, one of whose value is RWF_SYNC (write operation complete according to requirement of synchronized I/O file integrity). I think one possible solution of libaio is to leverage this flag.

Further, I also did some fio tests in new NVMe devices using uring_io engine. It is very fast which can largely utilize the internal parallelism of NVMe. So could it possible to make SQLite support this?


(4) By Clemens Ladisch (cladisch) on 2021-05-03 14:27:16 in reply to 3 [source]

But RWF_SYNC applies only to individual write operations.

Cached writes already are handled asynchronously by the kernel. SQLite does not care about individual writes, only that all of them have finished, and fsync() is the correct API for that.

Using an asynchronous API like libaio makes sense only when you have multiple parallel (read) requests, and when you can handle that the results become available in arbitrary order. This is not how the SQLite VFS works.