Using raw block device for read-only immutable database

(1) By Ciprian Dorin Craciun (ciprian.craciun) on 2024-09-18 11:50:13 [source]

First of all, there seems to have been a few similar questions in the past, but their use-cases are a bit more complex and perhaps incompatible with block devices.

So, here is my use-case (which I do believe is quite compatible):

I have an SQLite database file (in consistent state, i.e. without any pending WAL or journal files) that represents a read-only dataset (as in immutable, that never changes, never appends); (if the database needs to change, a new file is created;)
I want to expose this database file inside a micro-VM (like for example QEMU microVM profile, AWS's Firecracker, etc.), so that read-only queries can be executed against it;
I want to expose the database file as a block device (i.e. /dev/sdX or /dev/vdX) inside the micro-VM;
(due to performance and security reasons, I don't want to create a file-system image containing only the database file, nor can I export the host file-system;)

I understand from previous posts, that one can implement a custom VFS plugin that handles this (for example based on test-onefile), which would be quite simple because there are no additional temporary files, no locks, no writes, no file enlargement, etc. However that would mean injecting code into the micro-VM file-system, and I would like to exclude that if possible.

However, at a first glance, given that /dev/sdX or /mnt/.../file.db behave quite similarly with regard to read I/O, I would expect that by just using something like sqlite3 'file:/dev/loop0?mode=ro&immutable=1&vfs=unix-none' '.tables' should work.

Unfortunately, it doesn't. It doesn't error, it doesn't give any warnings, it just fails to find any tables.

I've tried to strace the above command, and is quite similar to what would happen on the backing file, but all SQLite does is:

openat(AT_FDCWD, "/dev/loop0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC) = 3

fstat(3, {st_dev=makedev(0, 0x6), st_ino=450, st_mode=S_IFBLK|0664, st_nlink=1, st_uid=0, st_gid=492, st_blksize=4096, st_blocks=0, st_rdev=makedev(0x7, 0), st_atime=1726659507, st_atime_nsec=538183401, st_mtime=1726656353, st_mtime_nsec=179655969, st_ctime=1726656353, st_ctime_nsec=179655969}) = 0
fstat(3, ...) = 0
pread64(3, "SQLite format 3\0\20\0\1\1\0@  \0\0\0\3\0\0\1\4"..., 100, 0) = 100
brk(0x55a77b2b7000)              = 0x55a77b2b7000

## Writes nothing and exits.
close(3)                         = 0
exit_group(0)                    = ?

As opposed to the following for the proper file (which also backed the /dev/loop0 in the previous run):

openat(AT_FDCWD, "/mnt/.../file.db", O_RDONLY|O_NOFOLLOW|O_CLOEXEC) = 3

fstat(3, {st_dev=makedev(0, 0x4e), st_ino=102677, st_mode=S_IFREG|0600, st_nlink=1, st_uid=10101, st_gid=10101, st_blksize=4096, st_blocks=2080, st_size=1064960, st_atime=1726655895, st_atime_nsec=269164315, st_mtime=1726655882, st_mtime_nsec=675907110, st_ctime=1726655882, st_ctime_nsec=675907110}) = 0
fstat(3, ...) = 0
pread64(3, "SQLite format 3\0\20\0\1\1\0@  \0\0\0\3\0\0\1\4"..., 100, 0) = 100
brk(0x5608b45ef000)              = 0x5608b45ef000

## The following is missing from the previous snippet!
pread64(3, "SQLite format 3\0\20\0\1\1\0@  \0\0\0\3\0\0\1\4"..., 4096, 0) = 4096
fstat(1, {st_dev=makedev(0, 0x19), st_ino=12, st_mode=S_IFCHR|0620, st_nlink=1, st_uid=10101, st_gid=5, st_blksize=1024, st_blocks=0, st_rdev=makedev(0x88, 0x9), st_atime=1726659560, st_atime_nsec=0, st_mtime=1726659560, st_mtime_nsec=0, st_ctime=1726654333, st_ctime_nsec=314568584}) = 0

## Writes found tables and exits.
write(1, "t\n", 2)               = 2
close(3)                         = 0
exit_group(0)                    = ?

The only difference seems to be that in case of the block device, fstat returns no st_size and just st_blocks=0, but still SQLite tries to read the first 100 bytes and succeeds, but then it just gives up.

Any ideas on how I could convince the standard SQLite implementation to just use the raw block device?

Thanks.

(2) By Bo Lindbergh (_blgl_) on 2024-09-18 13:12:22 in reply to 1 [link] [source]

Not possible. The standard implementation relies on st_size.

(3) By Ciprian Dorin Craciun (ciprian.craciun) on 2024-09-18 14:10:16 in reply to 2 [link] [source]

The standard implementation relies on st_size.

Grepping for st_size seems to be found only in os_unix.c (and os_kv.c).

Grepping (in os_unix.c) for S_ISREG seems te be found once, and S_IFREG is not found.

Thus, based on these observations, if one patches os_unix.c to take into account the S_ISBLK and correctly sets st_size in case of block devices, would this work?

Would such a patch / feature be accepted into upstream SQLite?

(4) By Bo Lindbergh (_blgl_) on 2024-09-18 14:55:41 in reply to 3 [link] [source]

How would this code know the correct value of st_size?

(5) By Ciprian Dorin Craciun (ciprian.craciun) on 2024-09-18 15:05:14 in reply to 1 [link] [source]

Based on the st_size feedback, I've made the following experiment based on the current SQLite release (3.46.1):

I've patched only os_unix.c;
I've replaced the fstat syscall (in the syscall table) with the following x_fstat implementation,
which x_fstat in case the descriptor is a block device, it tries to ioctl it to find the size in blocks, and correctly sets st_size accordingly;
I've not replaced the stat syscall, because it seems wherever it is used, the st_size field is not accessed; (perhaps I've missed something;)

int x_fstat(int f, struct stat* s) {
    int r;
    r = fstat(f, s);
    if (r != 0 || !S_ISBLK(s->st_mode))
        return r;
    size_t bs;
    r = ioctl(f, BLKGETSIZE, &bs);
    if (r != 0)
        return r;
    s->st_size = bs * 512;
    return 0;
}

I've made a very simple test (one select query) to see if this works over loop block device (i.e. /dev/loop0) and it seems it works without a problem.

I think this works only on Linux, and I don't think it's quite the right implementation, because perhaps one needs to use BLKGETSIZE64 and see if it overflows size_t, etc.

Could someone tell me how to run the SQLite test suite that pertains only to read-only databases? Although, given that one has to first prepare the database file, then losetup it, perhaps its more complex than the current code-base supports?

BTW, I've also tried to INSERT something into it, by using pragma journal_mode = memory;, and it seems to work. But, this is out of scope for my current question.

(6) By Bo Lindbergh (_blgl_) on 2024-09-18 15:44:23 in reply to 5 [link] [source]

I think this works only on Linux

Which is why it won't get into any official SQLite version.

(8) By Ciprian Dorin Craciun (ciprian.craciun) on 2024-09-18 16:05:59 in reply to 6 [link] [source]

What OS's should such a patch support to be considered in upstream?

I'm sure most BSD systems do support this. OSX and Windows could be optional?

(I bet not all features from os_unix.c are supported all targeted systems.)

(11) By Ciprian Dorin Craciun (ciprian.craciun) on 2024-09-18 17:00:56 in reply to 8 [link] [source]

I don't want to debate this too much (namely should a feature be supported only on some OSs), however I've done a quick search for Linux in the src folder and found it mentioned mainly in the os_unix.c, where it seems to enable the following extra features:

enabling the use of pread and pwrite in quite a few places for performance reasons;
enabling the use of mremap in a single place as an optimization;
enabling some very interesting ioctl calls for F2FS file-systems on Linux;

Thus perhaps, if such a feature is considered useful enough, another exception might be made?

Alternatively, one could use the POSIX lseek function to get the block device size as such:

either size = lseek(descriptor, 0, SEEK_END) and cache size for the rest of that file-descriptor;
either here = lseek(descriptor, 0, SEEK_CUR) to get the current offset, then use SEEK_END like above, and finally lseek(descriptor, here, SEEK_SET) to reset back; (which has quite an overhead;)

These would be used only if the file descriptor is a block device, thus no extra overhead besides an if (S_IFBLK(s->st_mode)) { ... }.

(7) By ian wild (ianwild) on 2024-09-18 16:01:02 in reply to 1 [link] [source]

You could put your database file on a squashfs.

due to performance and security reasons, I don't want to create a file-system image containing only the database file

I can't see any security problems that squashfs introduces over using a raw device and, if my utterly unscientific experiments are any indication, there are no performance problems either.

(9) By Ciprian Dorin Craciun (ciprian.craciun) on 2024-09-18 16:21:39 in reply to 7 [link] [source]

[...] if my utterly unscientific experiments are any indication, there are no performance problems either.

On the performance front, although I don't have numbers right now, there must be a performance penalty:

for starters, if the SquashFS image is compressed, just the decompression has performance penalties;
then, just the fact that there is a file-system logic (SquashFS in this case), does have some performance penalty as opposed to just read on the raw block device;

However, in my particular use-case the performance isn't hindered by the SquashFS (or any other file-system) overhead, but by the actual workflow (which might not be that unique):

for starters, in my initial post I've said that the database is immutable / read-only; that is true for the lifetime of the micro-VM;
however, the actual workflow is something like this: write something to the database, close, start micro-VM, execute read-only queries, loop;
thus, re-creating a file-system image (even FAT) on each loop implies lots of I/O and other complications;
(I could loop-mount the image file in the host-OS and change it there, umount and start the micro-VM, but that means I'm forced to use a particular file-system, and other limitations;)

Thus, for such a use-case, being able to export with low overhead a host file into a micro-VM is essential. And fortunately, all VM hypervisors support presenting any arbitrary file as a block device inside the guest VM. (And in particular, Firecracker doesn't support other mechanisms for sharing files with the host.)

Also, I could use something like NFS to expose the file inside the guest VM, but that implies exposing too much of the host to a potential malicious guest. (This is the "security" angle.)

(10) By Bo Lindbergh (_blgl_) on 2024-09-18 16:51:17 in reply to 1 [link] [source]

However that would mean injecting code into the micro-VM file-system

How do you execute the query-running program itself without injecting it?

(12) By Ciprian Dorin Craciun (ciprian.craciun) on 2024-09-18 17:21:04 in reply to 10 [link] [source]

How do you execute the query-running program itself without injecting it?

The customer provides a file-system image which contains the binaries required to execute the queries, together with the supporting libraries. In practical terms, this implies that the customer takes an off-the-shelf Linux distribution, put his binaries in there, and bundle that up. (Or, uses Docker / Podman to prepare a container image.)

What I try to avoid is having to replace the customer's /usr/lib/libsqlite3.so because then I would have to prepare pre-built versions of my patched SQLite for each and every potential Linux distribution and version. That is because libsqlite3.so depends on glibc which is not ABI compatible even between versions, let alone distributions. (Or perhaps the user bases his work on Alpine, thus now we also have MUSL in the mix.)

Also, the customer might provide a statically linked binary (which thus embeds SQLite) where I don't even have the option to replace the library with a patched one.

However, strictly back to my particular use-case, for the moment I plan only to support a few interpreted languages (say Python), thus I can easily replace libsqlite3.so with my own.

Unfortunately, this limits the flexibility.

Finally, I didn't want to get in the weeds with my particular use-case, I just believe there are quite a few valid use-cases where because one uses VM's which have the ability to expose files as block-devices, it could open a lot of opportunities for sharing read-only / immutable datasets between VM's.

Or, and this is another use-case I had in mind for a pet-project, in case of embedded devices, one could have SQLite directly written on the SD card (i.e. /dev/mmcblk0p1) used to distribute assets or datasets that can be easily swapped on SD cards. Here the advantage is that one can completely eliminate any file-system code (if one doesn't need persistence) thus reducing memory consumption and cycles.

For example in the case of RaspberryPi, one could have on the first partition the kernel and a mini Linux distribution as an initramfs, and use the second partition as a SQLite database for the assets. The RaspberryPI bootloader is the one that reads the kernel and initramfs from the first FAT partition, and from that point onward, the kernel doesn't even need a file-system (backed by a block device) if the application is able to read the assets via SQLite from the second partition.

(13) By Code Hz (codehz) on 2024-09-18 17:51:29 in reply to 12 [link] [source]

Also, the customer might provide a statically linked binary (which thus embeds SQLite) where I don't even have the option to replace the library with a patched one.

This also means that even if the sqlite3 mainline has fixed this problem, customers may still use an older version of sqlite3...

I suggest exploring some other solutions, for example, using virtio-fs instead of virtio-blk/traditional block device. Although virtio-fs may not be as fast as a block device, it is more compatible (and the performance difference is not significant in the read-only case)

(14) By Ciprian Dorin Craciun (ciprian.craciun) on 2024-09-18 18:17:33 in reply to 13 [link] [source]

I suggest exploring some other solutions, for example, using virtio-fs instead of virtio-blk/traditional block device.

Unfortunately, Firecracker (the VM hypervisor I intend to use due to its security focus and extremely low overhead) doesn't currently support VirtIO-based file-systems, and they don't intend to support it either as per:

https://github.com/firecracker-microvm/firecracker/issues/1180#issuecomment-671920165
https://github.com/firecracker-microvm/firecracker/pull/1351 -- "We still think the attack surface implications are large. Even in read-only mode, it's still a lot of code and a lot of new bits the guest can throw at the Firecracker emulation layer." (with which I agree myself;)

Thus, the only options remaining are:

expose a file-system image that contains the DB in the VM as a block device; (as noted in another comment of mine this is very sub-optimal;)
expose the database parent folder via something like NFS, which as with the VirtIO-based file-system has serious security implications (especially for multi-tenant scenarios); (in case of NFS, by letting the guest VM interact directly with the kernel of the host;)
directly expose the database file as block device in the VM; (the subject of this topic; low overhead, as secure as it can be;)
alternatively, by using SSH-FS and VSOCK (but without the actual SSH protocol overhead) exposing a FUSE-based file-system in the guest VM; but this is too complex and very inefficient;
(if FUSE would be an alternative, write a simple FUSE driver that just exposes a block device as a simple file;)
(have something that wraps SQLite queries as requests over the network, but at this point I can just drop SQLite and use Postgres or something similar;) (and I don't want to let the guest VM to touch the network;)

(15) By Bo Lindbergh (_blgl_) on 2024-09-18 19:09:32 in reply to 14 [link] [source]

Consider an image of an ISO 9660 filesystem containing a single file. The file data starts at a specific offset and extends to the end of the filesystem.

Inside a guest, you can mount a device based on the image and read from the database with minimal filesystem overhead.
On the host, there should be no problems writing a shim VFS for SQLite that operates on the image and exposes the virtual file inside. The image grows and shrinks along with the file; when flushing, the virtual directory entry is updated with the new file size and modification date and the volume header is updated with the new volume size.

In sum, as long as the database size remains under 4 GB, you can have zero-copy file-to-filesystem-image conversion.

(16) By Ciprian Dorin Craciun (ciprian.craciun) on 2024-09-18 19:35:43 in reply to 15 [link] [source]

This is an interesting approach, however as you've said the 4 GiB database size limit is problematic, especially when one has to share large datasets. (I don't know which file-system supports contiguous file-extents in the raw image, but perhaps EROFS might be an option?)

As with regard to the SQLite VFS shim, I think it's perhaps easier (if the host backing file-system supports like in the case of BTRFS and (?) BcacheFS) to use shared extents, and before each loop, just write the file-system header, "paste" the shared extent, and write any final trailers.

Though, at this point (when needing to develop low-level C code), and given that in my case I can control the running Linux kernel, perhaps I can write a simple kernel-native file-system that just exposes the raw block device as a folder containing a single file. (Kind of the reverse of /dev/loop0.)

Although, it would be wonderful if SQLite would just allow one to use the block device as a database, especially since the change "seems" trivial. :)

(17) By Stephan Beal (stephan) on 2024-09-18 20:01:05 in reply to 16 [link] [source]

Although, it would be wonderful if SQLite would just allow one to use the block device as a database, especially since the change "seems" trivial. :)

Though your demonstration does indeed make it seem so, an as-yet-unmentioned aspect of it is testing. It would be impossible to get the required 100% branch test coverage without having a raw-device db in place. The long-term maintenance burden of that, e.g. ensuring availability of such a device on dev machines, would fall on the sqlite team.

(18) By Ciprian Dorin Craciun (ciprian.craciun) on 2024-09-19 04:20:14 in reply to 17 [link] [source]

It would be impossible to get the required 100% branch test coverage without having a raw-device db in place.

Indeed, the biggest oversight in my patch was testing because it had none, being just a proof of concept, that's why I've initially put "seems" (in quotes). :)

However, I was searching the source code on how one tests the atomic write support in case of F2FS, and although there are a few places where related features are tested (gated by the feature support detection), I didn't actually found how the test is actually initiated (as in the F2FS being supplied).

Thus, assuming there is an already existing environment with a F2FS mount point where the tests are run, a similar setup could be done in this case by having an SQLite db file, losetup /dev/loop0 ./file.db, and then perhaps running some of the nolock tests on that database.

(19) By Nuno Cruces (ncruces) on 2024-09-19 06:24:41 in reply to 18 [link] [source]

F2FS may seem obscure, but it's used in millions off Android phones, each with 100s of SQLite DBs, where this feature makes a significant performance impact.

Given the scale, it wouldn't surprise me if there was sponsorship involved, but even if there wasn't, is definitely a feature worth supporting.

(20) By Ciprian Dorin Craciun (ciprian.craciun) on 2024-09-19 06:33:47 in reply to 19 [link] [source]

I didn't want to downplay the F2FS support, sorry if it was misinterpreted. I know it's a modern file-system made specifically for flash, and I as you've said I appreciate that SQLite has support for it, especially since flash-like storage seems to be the future.

I just mentioned it because testing it definitively requires a dedicated environment, thus similar to what it would require to test the raw block device feature.

(21) By Stephan Beal (stephan) on 2024-09-19 08:07:08 in reply to 18 [link] [source]

I didn't actually found how the test is actually initiated (as in the F2FS being supplied).

The 100% coverage guaranty is for a default configuration, which doesn't include F2FS. It would, however, include block device support unless a new feature flag were added just for that purposes (which seems superfluous, as it doesn't require any special APIs).

(22) By Nuno Cruces (ncruces) on 2024-09-19 11:19:18 in reply to 1 [link] [source]

Interestingly (to me, probably useless to you), I decided to test this, and it (apparently) just works with my SQLite Go bindings and VFS.

(23) By Ciprian Dorin Craciun (ciprian.craciun) on 2024-09-19 12:10:29 in reply to 22 [link] [source]

That is great to hear! Though I'm wondering how does it just work in your case?

Reading the README in your repository, it seems you are using a Go-based WASM/WASI runtime, and looking in the SQLite repository for WASI, I don't think it uses mmap (the major feature that seems to be disabled for this target), thus perhaps in this case the engine doesn't even try to look at the st_size?

(24.2) By Nuno Cruces (ncruces) on 2024-09-19 15:12:19 edited from 24.1 in reply to 23 [link] [source]

It's a custom Wasm build, the VFS is a from scratch reimplementation in pure Go, no WASI involved. Shared memory WAL is supported on a few unixes, using mmap.

This works, apparently, because I used lseek to check the file size. Which I can because I use pread/pwrite for reading/writing.