*** DRAFT ***
WAL-mode File Format

This document describes low-level details on how WAL mode is implemented on unix and windows.

The separate file format description provides details on the structure of a database file and of the write-head log file used in WAL mode. But details of the locking protocol and of the format of the WAL-index are deliberately omitted since those details are left to discretion of individual VFS implementations. This document fills in those missing details for the unix and windows VFSes.

For completeness, some of the higher level formatting information contains in the file format document and elsewhere is replicated here, when it pertains to WAL mode processing.

1. Files On Disk

When in active use, the state of a WAL mode database is described by three separate files:

  1. The main database file with an arbitrary name "X".
  2. The write-ahead log file, usually named "X-wal".
  3. The wal-index file, usually named "X-shm".

1.1. The Main Database File

The format of the main database file is as described in the file format document. The file format version numbers at offsets 18 and 19 into the main database must both be 2 to indicate that the database is in WAL mode. The main database may have an arbitrary name allowed by the underlying filesystem. No special file suffixes are required, though ".db", ".sqlite", and ".sqlite3" seem to be popular choices.

1.2. The Write-Ahead-Log or "-wal" File

The write-ahead log or "wal" file is a roll-forward journal that records transactions that have been committed but not yet applied to the main database. Details on the format of the wal file are describe in the WAL format subsection of the main file format document. The wal file is named by appending the four characters "-wal" to the end of the name of the main database file. Except on 8+3 filesystems, such names are not allowed, and in that case the file suffix is changed to ".WAL". But as 8+3 filesystems are increasingly rare, that exceptional case can usually be ignored.

1.3. The Wal-Index or "-shm" file

The wal-index file or "shm" file is not actually used as a file. Rather, individual database clients mmap the shm file and use it as shared memory for coordinating access to the database and as a cache for quickly locating frame within the wal file. The name of the shm file is the main database file name with the four characters "-shm" appended. Or, for 8+3 filesystems, the shm file is the main database file with the suffix changed to ".SHM".

The shm does not contain any database content and is not required to recover the database following a crash. For that reason, the first client to connect to a quiescent database will normally truncate the shm file if it exists. Since the content of the shm file does not need to be preserved across a crash, the shm file is never fsync()-ed to disk. In fact, if there were a mechanism by which SQLite could tell the operating system to never persist the shm file to disk but always hold it in cache memory, SQLite would use that mechanism to avoid any unnecessary disk I/O associated with the shm file. However, no such mechanism exists in standard posix.

Because the shm is only used to coordinate access between concurrent clients, the shm file is omitted if exclusive locking mode is set, as an optimization. When exclusive locking mode is set, SQLite uses heap memory in place of the memory-mapped shm file.

1.4. File Lifecycles

When a WAL mode database is in active use, all three of the above files usually exist. Except, the Wal-Index file is omitted if exclusive locking mode is set.

If the last client using the database shuts down cleanly by calling sqlite3_close(), then a checkpoint is run automatically in order to transfer all information from the wal file over into the main database, and both the shm file and the wal file are unlinked. Thus, when the database is not in use by any client, it is usually the case that only the main database file exists on disk. However, if the last client did not call sqlite3_close() before it shut down, or if the last client to disconnect was a read-only client, then the final cleanup operation does not occur and the shm and wal files may still exist on disk even when the database is not in use.

1.5. Variations

When PRAGMA locking_mode=EXCLUSIVE (exclusive locking mode) is set, only a single client is allowed to have the database open at one time. Since only a single client can use the database, the shm file is omitted. The single client uses a buffer in heap memory as a substitute for the memory-mapped shm file.

If a read/write client invokes sqlite3_file_control(SQLITE_FCNTL_PERSIST_WAL) prior to shutdown, then at shutdown a checkpoint is still run, but the shm file and wal file are not deleted. This allows subsequent read-only clients to connect to and read the database.

2. The WAL-Index File Format

The WAL-index or "shm" file is used to coordinate access to the database by multiple clients, and as a cache to help clients quickly locate frames within the wal file.

Because the shm file is not involved in recovery, the shm file does not need to be machine byte-order independent. Hence, numeric values in the shm file are written in the native byte order of the host computer, rather than being converted into a specific cross-platform byte order as is done with the main database file and the wal file.

The shm file consists of one or more hash tables, where each hash table is 32768 bytes in size. Except, a 136-byte header is carved out of the front of the very first hash table, so the first hash table is only 32632 bytes in size. The total size of the shm file is always a multiple of 32768. In most cases, the total size of the shm file is exactly 32768 bytes. The shm file only needs to grow beyond a single hash table if when the wal file grows very large (more than 4079 frames). Since the default automatic checkpoint threshold is 1000, WAL files rare reach the 4079 threshold needed to make the shm file grow.

2.1. The WAL-Index Header

The first 136 bytes of the shm file are a header. The shm header has three main divisions as follows:

WAL-Index Header Divisions
BytesDescription
0..47First copy of the WAL Index Information
48..95Second copy of the WAL Index Information
96..135Checkpoint Information and Locks

Individual fields of the shm header, except for the salt values copied from the WAL header, are unsigned integers in the native byte-order of the host machine. The salt values are exact copies from the WAL header and are in whatever byte order is used by the WAL file. The size of integers may be 8, 16, 32, or 64 bits. A detailed breakout of the individual fields of the shm header follows:

WAL-Index Header Details
BytesNameMeaning
0..3iVersion The WAL-index format version number. Always 3007000.
4..7  Unused padding space. Must be zero.
8..11iChange Unsigned integer counter, incremented with each transaction
12isInit The "isInit" flag. 1 when the shm file has been initialized.
13bigEndCksum True if the WAL file uses big-ending checksums. 0 if the WAL uses little-endian checksums.
14..15szPage The database page size in bytes, or 1 if the page size is 65536.
16..19mxFrame Number of valid and committed frames in the WAL file.
20..23nPage Size of the database file in pages.
24..31aFrameCksum Checksum of the last frame in the WAL file.
32..39aSalt The two salt value copied from the WAL file header. These values are in the byte-order of the WAL file, which might be different from the native byte-order of the machine.
40..47aCksum A checksum over bytes 0 through 39 of this header.
48..95  A copy of bytes 0 through 47 of this header.
96..99nBackfill Number of WAL frames that have already been backfilled into the database by prior checkpoints
100..119read-mark[0..4] Five "read marks". Each read mark is a 32-bit unsigned integer (4 bytes).
120..127  Unused space set aside for 8 file locks.
128..132nBackfillAttempted Number of WAL frames that have attempted to be backfilled but which might not have been backfilled successfully.
132..136  Unused space reserved for further expansion.

2.1.1. The mxFrame field

The 32-bit unsigned integer at offset 16 (and repeated at offset 64) is the number of valid frames in the WAL. Because WAL frame are numbered starting with 1, mxFrame is also the index of the last valid commit frame in the WAL. A commit frame is a frame that has a non-zero "size of database" value in bytes 4 through 7 of the frame header, and that indicates the end of a transaction.

When mxFrame field is zero, it indicates that the WAL is empty and that all content should be obtained directly from the database file.

When mxFrame is equal to nBackfill, that indicates that all content in the WAL has been written back into the database. In that case, all content can be read directly from the database. Furthermore, the next writer is free to reset the WAL if no other connections hold locks on WAL_READ_LOCK(N) for N>0.

The mxFrame value is always greater than or equal to both nBackfill and nBackfillAttempted.

2.1.2. The nBackfill field

The 32-bit unsigned integer at offset 128 in the WAL-index header is called the "nBackfill". this field holds the number of frames in the WAL file which have been copied back into the main database.

The nBackfill number is never greater than mxFrame. When nBackfill equals mxFrame, that means that the WAL content has been completely written back into the database and it is ok to reset the WAL if there are no locks held on any of WAL_READ_LOCK(N) for N>0.

The nBackfill can only be increased while holding the WAL_CKPT_LOCK. However, nBackfill is changed to zero during a WAL reset, and this happens while holding the WAL_WRITE_LOCK.

2.1.3. WAL Locks

Eight bytes of space are set aside in the header to support file locking using the xShmLock() method in the sqlite3_io_methods object. These eight bytes are never read nor written by SQLite since some VFSes (ex: Windows) might implement locks using mandatory file locks.

These are the eight locks supported:

WAL-Index Locks Controlled By xShmLock()
NameOffset
xShmLockFile
WAL_WRITE_LOCK 0 120
WAL_CKPT_LOCK 1 121
WAL_RECOVER_LOCK 2 122
WAL_READ_LOCK(0) 3 123
WAL_READ_LOCK(1) 4 124
WAL_READ_LOCK(2) 5 125
WAL_READ_LOCK(3) 6 126
WAL_READ_LOCK(4) 7 127

TBD: More information about the header

2.2. WAL-Index Hash Tables

The hash tables in the shm file are designed to answer the following question quickly:

FindFrame(P,M): Given a page number P and a maximum WAL frame index M, return the largest WAL frame index for page P that does not exceed M, or return NULL if there are no frames for page P that do not exceed M.

Let the datatypes "u8", "u16", and "u32" mean unsigned integers of length 8, 16, and 32 bits, respectively. Then, the first 32768-byte unit of the shm file is organized as follows:

u8 aWalIndexHeader[136];
u32 aPgno[4062];
u16 aHash[8192];

The second and all subsequent 32768-byte units of the shm file are like this:

u32 aPgno[4096];
u16 aHash[8192];

Collectively, the aPgno entries record the database page number stored in all frames of the WAL file. The aPgno[0] entry on the first hash table records the database page number stored in the very first frame in the WAL file. The aPgno[i] entry from the first hash table is the database page number for the i-th frame in the WAL file. The aPgno[k] entry for the second hash table is the database page number for the (k+4062)-th frame in the WAL file. The aPgno[k] entry for the n-th 32768-byte hash table in the shm file (for n>1) holds the database page number stored in the (k+4062+4096*(n-2))-th frame of the WAL file.

Here is a slightly different way to describe the aPgno values: If you think of all aPgno values as a contiguous array, then the database page number stored in the i-th frame of the WAL file is stored in aPgno[i]. Of course, aPgno is not a contiguous array. The first 4062 entries are on the first 32768-byte unit of the shm file and subsequent values are in 4096 entry chunks in later units of the shm file.

One way to compute FindFrame(P,M) would be to scan the aPgno array starting with the M-th entry and working backwards towards the beginning and return J where aPgno[J]==P. Such an algorithm would work, and it would be faster than searching the whole WAL file for the latest frame with page number P. But the search can be made much faster still by using the aHash structure.

A database page number P is mapped into a hash value using the following hash function:

h = (P * 383)%8192

This function maps every page number into an integer between 0 and 8191 inclusive. The aHash field of each 32768-byte shm file unit maps P values into indexes of the aPgno field of the same unit as follows:

  1. Compute the hash value: h = P * 383
  2. Let X be the largest set of consecutive integers {h, h+1, h+2, ..., h+N} such that for every j in X, aPgno[j%8192]!=0. The X set will be empty if aPgno[h%8192]==0. The X set is easily computed by starting with the value h%8192, and adding h%8192 to X and incrementing h until encountering the first aPgno[h%8192] entry that is zero.
  3. The set X contains the index in aPgno of every entry in the current 32768-byte unit of the shm file that might possible be a solution to the FindFrame(P,M) function. Each of these entries must be checked separately to ensure that the aPgno value is P and that the frame number does not exceed M. The largest frame number that passes those two tests is the answer.

Each entry in the aPgno array has a single corresponding entry in the aHash array. There are more available slots in aHash than there are in aPgno. The unused slots in aHash are filled with zero. And since there are guaranteed to be unused slots in aHash, that means the loop that computes X is guaranteed to terminate. The expected size of X is less than 2. The worst case is that X will be the same as the number of entries in aPgno, in which case the algorithm runs at about the same speed as a linear scan of aPgno. But that worst case performance is exceedingly rare. Usually, the size of X will be small and the use of the aHash array allows one to compute FindFrame(P,M) much faster.

Here is an alternative way of describing the hash look-up algorithm: Start with h = (P * 383)%8192 and look at aHash[h] and subsequent entries, wrapping around to zero when h reaches 8192, until finding an entry with aHash[h]==0. All aPgno entries having a page number of P will have an index that is one of the aHash[h] values thusly computed. But not all the computed aHash[h] values will meet the matching criteria, so you must check them independently. The speed advantage comes about because normally this set of h values is very small.

Note that each 32768-byte unit of the shm file has its own aHash and aPgno arrays. The aHash array for a single unit is only helpful in finding aPgno entries in that same unit. The overall FindFrame(P,M) function needs to do hash lookups beginning with the latest unit and working backwards to the oldest unit until it finds an answer.

2.3. Locking Matrix

Access is coordinated in WAL mode using both the legacy DELETE-mode locks controlled by the xLock and xUnlock methods of the sqlite3_io_methods object and the WAL locks controlled by the xShmLock method of the sqlite3_io_methods object.

Conceptually, there is just a single DELETE-mode lock. The DELETE-mode lock for a single database connection can be in exactly one of the following states:

  1. SQLITE_LOCK_NONE (unlocked)
  2. SQLITE_LOCK_SHARED (reading)
  3. SQLITE_LOCK_RESERVED (reading, waiting to write)
  4. SQLITE_LOCK_PENDING (new readers blocked, waiting to write)
  5. SQLITE_LOCK_EXCLUSIVE (writing)

The DELETE-mode locks are stored on the lock-byte page of the main database file. Only SQLITE_LOCK_SHARED and SQLITE_LOCK_EXCLUSIVE are factors for WAL-mode databases. The other locking states are used in rollback-mode, but not in WAL-mode.

The WAL-mode locks are described above.

2.3.1. How the various locks are used

The following rules show how each of the locks is used.

2.3.2. Operations that require locks and which locks those operations use

3. Recovery

Recovery is the process of rebuilding the WAL-index so that it is synchronized with the WAL.

Recovery is run by the first thread to connect to a WAL-mode database. Recovery restores the WAL-index so that it accurately describes the WAL file. If there is no WAL file present when the first thread connects to the database, there is nothing to recover, but the recovery process still runs to initialize the WAL-index.

If the WAL-index is implemented as a memory-mapped file and that file is read-only to the first thread to connect, then that thread creates an private heap-memory ersazt WAL-index and runs the recovery routine to populate that private WAL-index. The same data results, but it is held privately rather that being written into the public shared memory area.

Recovery works by doing a single pass over the WAL, from beginning to end. The checksums are verified on each frame of the WAL as it is read. The scan stops at the end of the file or at the first invalid checksum. The mxFrame field is set to the index of the last valid commit frame in WAL. Since WAL frame numbers are indexed starting with 1, mxFrame is also the number of valid frames in the WAL. A "commit frame" is a frame that has a non-zero value in bytes 4 through 7 of the frame header. Since the recovery procedure has no way of knowing how many frames of the WAL might have previously been copied back into the database, it initializes the nBackfill value to zero.

During recovery of the global shared-memory WAL-index, exclusive locks are held on WAL_WRITE_LOCK, WAL_CKPT_LOCK, WAL_RECOVER_LOCK, and WAL_READ_LOCK(1) through WAL_READ_LOCK(4). In other words, all locks associated with the WAL-index except for WAL_READ_LOCK(0) are held exclusively. This prevents any other thread from writing the database and from reading any transactions that are held in the WAL, until the recovery is complete.

This page last modified on 2022-01-08 05:02:57 UTC

*** DRAFT ***