LPC-2019: SQLite on Linux

This is a briefing on SQLite intended for Linux kernel hackers, and especially those working on Linux filesystems.

Written in a terse format for rapid skimming
Contact drh@sqlite.org with questions or more detail

1.0 SQLite Is Not Like Other Database Engines

SQLite is not a separate process or thread. SQLite is a subroutine. SQLite is embedded in the application and uses the same heap and stack.
- Because no IPC is involved, SQLite has low latency. Reading or writing many small blobs using SQLite is faster than reading/writing those blobs from separate files on disk. [1] [2]
An SQLite database is a single ordinary file on disk.
- An additional transient journal file may appear from time to time to help implement transactions that are atomic across crashes and power failures.
- Database sizes can range from 512 bytes up to 140,735,340,740,610 bytes. Most databases are a few megabytes to a few gigabytes in size, though terabyte-sized SQLite databases are known to be used in production.
- The database file format is stable, well-defined, well-documented, and cross-platform. SQLite databases are commonly used as storage containers for sending structured content across the internet.
Other database engines are usually found in the datacenter. SQLite is more commonly seen at the edge of the network.
- Because the network edge is so broad, there are a vast number SQLite databases in active use - probably over one trillion (1e12).
- A typical Android phone has hundreds of SQLite databases and does more than 5 gigabytes of database I/O per day.
Further information

2.0 Filesystem Properties That SQLite Would Like To Know About

SQLite works unsupervised, in whatever environment it is handed. SQLite does not get to choose a particular filesystem type or specific mount options. There is no configuration file available to tell SQLite about the capabilities of system on which it is running. SQLite needs to work well on a USB memory stick with a DOS filesystem up to an enterprise-class SSD with battery back-up and the latest filesystem, and everything in between.

To help it work most efficiently, SQLite needs to know attributes of the environment in which it is running. And because it lacks a configuration file, SQLite needs to discover these attributes on its own, at run-time. Some attributes of the system that SQLite would like to know about include:

Sector size.
Are sector writes atomic? And if so, for what size sectors?
If a power loss occurs at about the same time that a file is being extended with new data, will the file be guaranteed to contain valid data after reboot, or might the extended area of the file contain all zeros or all ones or arbitrary content? In other words, is the file data always committed to disk ahead of the file size?
If a power loss occurs at about the same time as a file truncation, is it possible that the truncated area of the file will contain arbitrary data after reboot? In other words, is the file size guaranteed to be committed to disk before the data sections are released?
Is the file immutable. In other words, is it read-only even for root-privileged processes?
Can a mmap() of the file be used for shared memory by all processes that have access to the file. (This is false for network filesystems. So the question is approximately the same as "is the file on a network filesystem?")
If a write occurs on one or two bytes of a file at about the same time as a power loss, are other bytes of the file guaranteed to be unchanged after reboot? Or might some other bytes within the same sector have been modified as well?
When you create a new file, write to it, and fdatasync() successfully, is it also necessary to open and fsync() the containing directory in or to ensure that the file will still be there following reboot from a power loss?
Has a file been unlinked or renamed since it was opened? (SQLite accomplishes this now by remembering the device and inode numbers obtained from fstat() and comparing them against the results of subsequent stat() calls against the original filename. Is there a more efficient way to do this?)
Has a particular file been created since the most recent reboot?
Is it possible (or helpful) to tell the filesystem that the content of a particular file does not need to survive reboot?
Is it possible (or helpful) to tell the filesystem that a particular file can be unlinked upon reboot?
Is it possible (or helpful) to tell the filesystem about parts of the database file that are currently unused and that the filesystem can zero-out without harming the database?

3.0 Transactions

Like all relational databases, SQLite needs to implement transactions that are ACID (Atomic, Consistent, Isolated, and Durable) even if the application crashes (for example by SIGKILL) or if there is an unexpected power loss. This is achieved in one of three ways:

A rollback journal
A write-ahead log
Using atomic-write capabilities of the underlying filesystem.

The rollback journal approach is the slowest, but also the most likely to work on any filesystem. Rollback is therefore the default. The write-ahead log (WAL mode) is faster, but only works on systems for which either (1) only a single process accesses the database at a time or (2) a mmap()-ed file can be used as shared memory.

The third option is the fastest but is currently only supported for the F2FS filesystem on Linux.

Each database file is either in rollback-mode or in WAL-mode. All processes access the database file according to its defined mode. Changing the mode of a database requires exclusive access to the database. The F2FS atomic write capability (option 3) is an extension of rollback mode (option 1).

3.1 Transactions using a rollback journal

An SQLite database is a sequence of one or more "pages". All pages in the same database file are the same size. But for different databases, the page size can be any power of two between 512 and 65536. A single change to the database typically involves modifications to multiple pages. This is done atomically as follows:

Creating a rollback journal file in the same directory as the database and having the same name except with the "-journal" extension added.
The original content of any database page that is to be modified is written into the rollback journal.
Call fdatasync() on the rollback journal.
Open the directory that contains the rollback journal and fdatasync() that directory, to ensure that the filename will exist following a power loss
Acquire an exclusive lock on the database file
Barrier - All of the above must complete before any of the following.
Make changes to the database file.
Call fdatasync() on the database file. If the database file decreases in size, also call ftruncate().
Barrier - All of the above must complete before any of the following.
Commit the transaction by doing one of:
1. Unlink the rollback journal
2. Truncate the rollback journal to zero bytes in size
3. Overwrite the first 8 bytes of the rollback journal with zeros
Drop the exclusive lock

Notes:

On step (1), the rollback journal and the database file are kept in the same directory in order to ensure that they are on the same volume, and thus do not become separated from one another following a reboot.
At step (10), SQLite implicitly assumes that each of the three commit operations (unlink, truncate, or header-overwrite) are atomic.
All writes to the database file are an integer number of pages and are page-aligned.
Writes to the rollback journal are appends or linear overwrites, except for overwriting the header at step 10c.

3.2 Recovery Of A Rollback Journal Following A Crash

When initially opening the database file, check for the presence of a well-formed rollback journal. If not found → Done.
Play back the rollback journal into the database file.
Call fdatasync() on the database file.
Barrier - All of the above must complete before any of the following.
Disable the rollback journal using one of the following:
1. Unlink the rollback journal
2. Truncate the rollback journal to zero bytes in size
3. Overwrite the first 8 bytes of the rollback journal with zeros

3.3 Transactions using write-ahead log

Append modified pages to the write-ahead log file (the "WAL file"). The WAL file is a file in the same directory as the database and with the same name as the database but with "-wal" appended.
Append a "commit mark"
Call fdatasync() on the WAL file to commit the transaction

Notes:

Step 3 can be omitted with the consequence that transactions are no longer durable. In other words, a transaction that was reported as having committed might rollback following a power failure. Most applications are cool with losing the last few transactions on a power loss as long as the database is still well-formed and consistent following reboot.
"Append" in the above might mean "linear overwrite".
A separate "-shm" file is mmap()-ed by all processes that want to access the database. The "-shm" file contains a hash table used to quickly locate pages previously written into the "-wal" file.
The last process to close its connection to the database will automatically run a checkpoint, then unlink the WAL file and the "-shm" file, assuming the last process shuts down cleanly. If the last process accessing the database simply exit()s without invoking sqlite3_close() or if the process crashes or if there is a power loss, the "-wal" and "-shm" files are left lingering on disk. The left-over "-wal" and "-shm" files will be cleaned up by the next process to open the database.

3.4 Checkpointing A Write-Ahead Log

Call fdatasync() on the "-wal" file.
Wait for all concurrent readers to stop using pages in the main database that have changes in the WAL file.
Barrier - All of the writes to the WAL file must complete before any subsequent writes to the database file.
Write the most recent change for every page in the WAL file back into the database file.
- Pages are sorted so that they are written in increasing order. → Is this helpful on Linux? Or can we just as well skip the sort and write the pages in any arbitrary order?
Call fdatasync() on the database file.
Barrier - All of the above must complete before any of the following.
Truncate the WAL file, or otherwise cause the WAL file to start over again at the beginning.
- We have seen that overwriting an existing file is faster than truncating and appending. Is that always the case?

Notes:

The default action is that the first process that runs a commit that causes the WAL file to grow beyond 1000 pages runs a checkpoint. However, this behavior can be changed by the application. The "auto-checkpoint" threshold can be raised or lowered. Or a system can dedicate a separate thread or process to run periodic checkpoints.
Checkpoints can block on step 2. Or, then can return early, having only done a subset of their work. Non-blocking is the default.

3.5 Recovery Of A Write-Ahead Log Following A Crash

Rebuild the "-shm" file by scanning the "-wal" file.

3.6 Transactions using the atomic-write capabilities of F2FS

F2FS is a log-structured filesystem for Linux that has (limited) atomic write capabilities. SQLite is able to use the atomic write capabilities of F2FS to bypass the rollback journal. Anecdotal reports are that an Android phone that is reformatted to use F2FS is noticeably faster. F2FS make an old tired phone feel like a perky new phone.

The atomic write capability in F2FS is implemented by three ioctl()s:

F2FS_IOC_START_ATOMIC_WRITE (hereafter "F2FS-BEGIN") - begins a new transaction.
F2FS_IOC_COMMIT_ATOMIC_WRITE (hereafter "F2FS-COMMIT") - commits a block of changes atomically.
F2FS_IOC_ABORT_VOLATILE_WRITE (hereafter "F2FS-ROLLBACK") - rollback all changes for the transaction and restore the file descriptor to the state it was in prior to F2FS-BEGIN.

The F2FS atomic write capability is only useful for databases that would otherwise be in rollback mode. The atomic write capability is not helpful for WAL-mode databases.

Transactions using F2FS atomic write proceed as follows:

Start a new transaction by invoking F2FS-BEGIN. If that ioctl() fails, fallback to using a rollback journal.
Write changes directly into the database file. If any write fails, SQLite will abort the transaction by invoking F2FS-ROLLBACK and restart the transaction using a rollback journal.
If the application requests a transaction abort by issuing an SQL "ROLLBACK" statement, then SQLite aborts the atomic write by invoking the F2FS-ROLLBACK ioctl().
Commit the transaction by invoking the F2FS-COMMIT ioctl(). If that ioctl() fails, SQLite will retry the transaction using a traditional rollback journal.

Notes:

Writes issued on a file descriptor for which a transaction has been started using F2FS-BEGIN must never be visible on any other file descriptor prior to F2FS-COMMIT.
- If there is a power loss or other system crash prior to F2FS-COMMIT, then none of the uncommitted changes should appear in the file after reboot.
- If the file descriptor on which the F2FS-BEGIN was issued closes before F2FS-COMMIT, then all changes on that file descriptor since the last F2FS-BEGIN are discarded.
The kernel can fail any operation after an F2F2-BEGIN and SQLite will recover gracefully by cancelling the transaction (using F2FS-ROLLBACK) and restarting the transaction using a traditional rollback journal.
The application can abort a transaction at any time. SQLite will invoke F2FS-ROLLBACK and expect the system to forget any changes from writes to that file descriptor back through the most recent F2FS-BEGIN.
The result of write() calls must be visible to subsequent read() calls on the same file descriptor that issued F2FS-BEGIN. But those writes must be invisible to read()s from separate file descriptors until after F2FS-COMMIT.
Following F2FS-ROLLBACK, all read() calls must return the same data that they would have returned if none of the write() calls back to the most recent F2FS-BEGIN had never occurred.

4.0 Potential Enhancements

Performance of SQLite on Linux is already very fast. However, it might be improved further with new Kernel interfaces.

4.1 fbarrier()

SQLite uses fsync() or fdatasync() to ensure I/O operations are complete. However, if an application is willing to forego Durability (The "D" in "ACID") following a power loss then many of these fdatasync() calls could be replaced by a hypothetical fbarrier(). This would be useful in each spot marked above by "Barrier".

In order to be useful, the hypothetical fbarrier() operation must:

Sequence operations on at least two file descriptors. For example, writes to the rollback journal must be complete before any writes to the database begin.
Apply to all operations on the files in question, including write(), ftruncate(), and unlink().

4.2 Designate unused portions of a file

When an application deletes content from an SQLite database, SQLite does not normally overwrite the deleted content, but merely remembers that the space is available for reuse. This avoids unnecessary writes. (Exception: Sometimes an application actually wants content to be overwritten to avoid forensic traces, and SQLite supports this upon request. For example, Firefox puts SQLite into the mode where it overwrites deleted content with zeros when you clear your search history.)

SQLite could tell the filesystem about regions of the file that are unused and do not need to be preserved.

Read()s from regions designated as unused may return all-zeros, or all-ones, or arbitrary bits, and SQLite will not care.
The first write() to an unused region will repeal the "unused" status of that region.