SQLite Forum

Feedback on writing a SQLite VFS for a distributed object store
Login

Feedback on writing a SQLite VFS for a distributed object store

(1) By Patrick Donnelly (pdonnell) on 2021-05-12 02:29:30 [link] [source]

Hello,

I've recently completed work on a new SQLite VFS [1,2] for the Ceph distributed storage system. I thought I'd share some feedback on my experience writing the VFS:

- The read API is unfortunately synchronous: every read requires us to wait for the read result from an object storage device (OSD). It would be better if SQLite could initiate every or as-many-as-can-fit-in-page-cache reads it requires to execute a query before using the results. This would also avoid the potential inefficiencies of bolting on readahead to improve matters at the VFS layer.

- Loading a VFS from a static/compile-time-linked-dynamic library requires awkward API gymnastics [3]. Surely this could be simplified?

- Some VFS members do not seem to be required but this is not documented. For example, xRandomness/xGetLastError appear to not be required. On the other hand, xCurrentTime is required (SIGSEGV if not present). I don't really see a reason for SQLite to require every VFS to supply xCurrentTime if a SQLite-chosen default is acceptable.

- xFullPathname could be improved to note it is called once for each attach of a database. Its purpose seems to be the canonicalization of the database file name?

- Temporary tables lack any "context" information about the database they are associated with. There was a pragma "temp_store_directory" which could have been used but that is deprecated. For a distributed storage solution, we need some kind of context about where to store transient tables (which storage pool, namely). Right now, we force the user to put temporary tables in memory [4].

- The xSectorSize appears to have no effect on the size of reads/writes performed on the database. We have been advising the use of page_cache pragmas instead [5].

- Journal persistence (PERSIST) is a great optimization for this VFS. It would be useful if we could hint this as a new default without requiring it as a pragma for every db connection.
- It would be fantastic to run some of SQLite's tests in our own upstream CI but using an alternate VFS appears to be a non-trivial change for all open source tests I examined.

Except for these issues, the new VFS works pretty well. Thank you for your work on SQLite!

[1] https://ceph.io/community/new-in-pacific-sql-on-ceph/
[2] https://docs.ceph.com/en/pacific/rados/api/libcephsqlite/
[3] https://github.com/ceph/ceph/blob/36ba4d71d04493c7742d5f6cf38e0d17dcb5a9e1/src/include/libcephsqlite.h#L33-L56
[4] https://docs.ceph.com/en/pacific/rados/api/libcephsqlite/#temporary-tables
[5] https://docs.ceph.com/en/pacific/rados/api/libcephsqlite/#page-size

(2) By David Jones (vman59) on 2021-05-14 21:20:54 in reply to 1 [link] [source]

For my custom VFS, I just patched sqlite3_os_init function to do some additional setup. I also patched sqlite3Pragma() so the VFS can return a table value in response to the pragma file control.

(3) By mlin (dnamlin) on 2021-05-14 23:47:32 in reply to 1 [link] [source]

Nice work!

The read API is unfortunately synchronous: every read requires us to wait for the read result from an object storage device (OSD). It would be better if SQLite could initiate every or as-many-as-can-fit-in-page-cache reads it requires to execute a query before using the results.

This sounds like a tall order to me, in terms of the replumbing of SQLite that'd be needed; but one complementary idea I've wondered about idly (for sqlite_web_vfs): could we gain something by aggressively prefetching all the 'interior' b-tree pages -- supposing we kept a list of their page numbers stashed somewhere?

I've been meaning to run some experiments to convince myself quantitatively whether that's a good idea, i.e. are they a small enough proportion of the database file and would having them all upfront save enough serial roundtrips. LMK if you might like to join forces on that.

Also neat would be if there were a way to make the pager try to store interior pages contiguously, which (correct me) I don't think is a goal right now, even for vacuum.

(4) By Patrick Donnelly (pdonnell) on 2021-05-16 18:19:17 in reply to 3 [link] [source]

> This sounds like a tall order to me

Yes it may be too difficult but there are other things SQLite could do which would not be as invasive. For example, it could do priming reads for a range of pages it will need in the future and let the VFS asynchronously fetch/cache those reads for a future blocking read. That would neatly avoid any issues with asynchronous completions but I don't know how hard it would be to adapt SQLite for that model. All I do know is that trying to implement some kind of readahead in the VFS would be unlikely to yield the benefits we would like.

I'm not that knowledgeable with SQLite internals to help with this project (sorry). This will probably be as deep as I go. :)

(5) By Keith Medcalf (kmedcalf) on 2021-05-16 18:52:31 in reply to 4 [source]

In modern Operating Systems (that is, anything developed since 1970) this has been the job of the Operating System.

While it may appear to the lay observer that duplicating this capability in each and every application programme might lead to lesser unemployment, there is a severe lack of capability to do so and attempts to "do optimization at the wrong level" have always resulted in disaster.

(7) By Kees Nuyt (knu) on 2021-05-16 22:42:43 in reply to 5 [link] [source]

In modern Operating Systems (that is, anything developed since 1970) this has been the job of the Operating System.

As a former systems programmer, I totally agree with you. And I support your reasoning.

(8) By mlin (dnamlin) on 2021-05-16 23:02:31 in reply to 5 [link] [source]

We might rephrase OP's idea to asking whether the btree pager could usefully send posix_fadvise(..., POSIX_FADV_WILLNEED, ...) through the VFS. I doubt it, for the reasons I've elaborated, but it seems to me a reasonable concept in fact implemented by a lot of databases.

(6) By mlin (dnamlin) on 2021-05-16 22:34:38 in reply to 4 [link] [source]

it could do priming reads for a range of pages it will need in the future and let the VFS asynchronously fetch/cache those reads for a future blocking read.

This is kind of what I was getting at with my suggestion that we look at prefetching interior b-tree pages, as I think having them upfront should greatly reduce the need for serial chains of network roundtrips to navigate the database file.

The problem (AFAIK; I'd be delighted to be corrected) is that those pages are scattered arbitrarily throughout the database file, without any tendency to contiguity, even after vacuum. So, short of changing the b-tree and pager implementation to promote that, we'd have to maintain somewhere a consistent list of the desired page numbers, and preferably even a full copy of their content, to be fetched upfront.

All I do know is that trying to implement some kind of readahead in the VFS would be unlikely to yield the benefits we would like.

My experience in sqlite_web_vfs and sqlite_zstd_vfs is that readahead does work great for cases it should obviously help, namely, sequential scans of large tables in vacuumed database files. But, the "time to first result row" can be pretty bad due to the aforementioned need to serially read numerous scattered pages, in order to navigate to the part of the database file that'll be read sequentially.

  • https://github.com/mlin/sqlite_web_vfs
  • https://github.com/mlin/sqlite_zstd_vfs