The "server-process-edition" Branch

The "server-process-edition" branch contains two modifications to stock SQLite that work together to provide concurrent read/write transactions using pessimistic page-level-locking. The system runs in two modes:

Single-process mode - where all clients must be within the same address space, and
Multi-process mode - where clients may be distributed between multiple OS processes.

The system is designed to be most efficient when used with "PRAGMA synchronous=OFF", although it does not require this.

Up to 16 simultaneous read/write transactions controlled by page-level-locking are possible. Additionally, in single-process mode there may be any number of read-only transactions started using the "BEGIN READONLY" command. Read-only transactions do not block read-write transactions, and read-write transactions do not block read-only transactions. Read-only transactions access a consistent snapshot of the database - writes committed by other clients after the transaction has started are never visible to read-only transactions. In multi-process mode, the "BEGIN READONLY" command is equivalent to a stock "BEGIN".

The two features on this branch are:

An alternative layout for the database free-page list. This is intended to reduce contention between writers when allocating new database pages, either from the free-list or by extending the database file.
The "server-mode" extension, which provides read/write page-level-locking concurrency and (in single-process mode) read-only MVCC concurrency mentioned above.

1.0 Alternative Free-List Format

The alternative free-list format is very similar to the current format. It differs in the following respects:

The "total number of free pages" field in the db header is not maintained. It is always set to zero.
Instead of pointing to the first free-list trunk page, the free-list pointer in the db header points to a page known as the "free-list node".
The free-list node page contains N pointers to free-lists stored in the legacy format (i.e. a linked list of trunk pages each containing pointers to many free leaf pages).

This effectively means that a database has N free-lists instead of just one. To allocate a free page, a writer only needs to lock one such free-list, and so up to N transactions that allocate new pages may proceed concurrently.

Allocating pages from the end of the db file still locks out all other read/write transactions (because it involves writing to page 1, which every transaction needs to read). To minimize the frequency with which this occurs, when a page must be allocated from the end of the database file, the file is extended by 2048 pages. These are distributed equally between 16 free-lists (children of the free-list node page). Additionally, the first trunk page in each free list is never reused. Doing so would require writing to the free-list node page - effectively an exclusive lock on the entire page-allocation system.

The current format used for the free-list can be modified or queried using a new pragma:

  PRAGMA [database.]freelist_format;
  PRAGMA [database.]freelist_format = 1|2;

At present, the free-list format may only be modified when the free-list is completely empty. Which, as the implementation ensures that a free-list that uses the alternative format is never completely emptied, effectively precludes changing the format from 2 (alternative) to 1 (legacy).

For databases that use the "alternative" free-list format, the read and write versions in the database header (byte offsets 18 and 19) are set to 3 for rollback mode or 4 for wal mode (instead of 1 and 2 respectively).

2.0 Page level locking - "Server Mode"

A database client automatically enters "server mode" if there exists a directory named "<database>-journal" in the file system alongside the database file "<database>" There is currently no provision for creating this directory, although it could be safely done for a database in rollback mode using something like:

  PRAGMA journal_mode = off;
  BEGIN EXCLUSIVE;
    <create directory>
  END;

As well as signalling new clients that they should enter server-mode, creating a directory named "<database>-journal" has the helpful side-effect of preventing legacy clients from accessing the database file at all.

If the VFS is one that takes an exclusive lock on the db file (to guarantee that no other process accesses the db file), then the system automatically enters single-process mode. Otherwise, multi-process mode.

In both single and multi-process modes, page-level-locking is managed by allocating a fixed-size array of "locking slots". Each locking slot is 32-bits in size. By default, the array contains 262144 (2^18) slots. Pages are assigned to locking slots using the formula (pgno % 262144) - so pages 1, 262145, 524289 etc. share a single locking slot.

In single-process mode, the array of locking slots is allocated on the process heap and access is protected by a mutex. In multi-process mode, it is created by memory-mapping a file on disk (similar to the *-shm file in SQLite wal mode) and access is performed using atomic CAS primitives exclusively.

Each time a read/write transaction is opened, the client assumes a client id between 0 and 15 for the duration of the transaction. Client ids are unique at any point in time - concurrently executing transactions must use different client ids. So there may exist a maximum of 16 concurrent read/write transactions at any one time.

Read/write transactions in server-mode are similar to regular SQLite transactions in rollback mode. The most significant differences are that:

Instead of using journal file <database>-journal, server-mode clients use <database>-journal/<client-id>-journal. If there are multiple concurrent transactions, each uses a separate journal file.
No database-wide lock is taken. Instead, individual read and write locks are taken on the pages accessed by the transaction.

Each locking slot is 32-bits in size. A locking slot may simultaneously support a single write-lock, up to 16 read-locks from read/write clients, and (in single process mode) up 1024 read-locks from "BEGIN READONLY" clients. Locking slot bits are used as follows:

The least-significant 16-bits are used for read-locks taken by read/write clients. To take a read-lock, bit <client-id> of the locking slot is set.
The next 5 bytes are used for the write-lock. If no write-lock is held on the slot, then this 5 byte integer is set to 0. Otherwise, it is set to (C + 1), where C is the <client-id> of the client holding the write-lock.
The next 10 bits contain the total number of read-locks held by "BEGIN READONLY" clients on the locking slot. See the section below for a description of how these are used.

Currently, if a client requests a lock that cannot be granted due to a conflicting lock, SQLITE_BUSY is returned to the caller and either the entire transaction or statement transaction must be rolled back. See Problems and Issues below for more details.

2.1 Single-Process Mode

Single process mode is simpler than multi-process mode because it does not have to deal with runtime client failure - it is assumed that if one client fails mid-transaction the entire process crashes. As a result the only time hot-journal rollback is required in single-process mode is as part of startup. The first client to connect to a database in single-process mode attempts to open and rollback all 16 potential hot journal files.

But, in order to support non-blocking "BEGIN READONLY" transactions, it is also in some ways more complicated than multi-process mode. "BEGIN READONLY" support works as follows:

In single-process mode, writers never spill the cache mid-transaction. Data is only written to the database as part of committing a transaction.
As well as writing the contents of overwritten pages out to the journal file, a writer in single-process mode also accumulates a list of buffers containing the original data for each page overwritten by the current transaction in main-memory.
When a transaction is ready to be committed, a writer obtains a transaction-id. Transaction-ids are assigned to writers using a monotonically increasing function. The writer then adds all of its "old data" buffers to a hash table accessible to all database clients. Associated with each hash table entry is the newly assigned transaction-id. It then waits (spin-locks) for all "BEGIN READONLY" read-locks to clear on all pages that will be written out by the transaction. Following this, it commits the transaction as normal (writes out the dirty pages and zeroes the journal file header).
Clients executing "BEGIN READONLY" transactions are not assigned a <client-id>. Instead, they are assigned a transaction-id that is either (a) that of the oldest transaction-id belonging to a writer that has not yet finished committing, or (b) if there are currently no writers committing then the value that will be assigned to the next committer.
When a "BEGIN READONLY" transaction reads a page, it first checks the aforementioned hash table for a suitable entry. A suitable entry is one with the right page-number and a transaction-id greater than or equal to that of the "BEGIN READONLY" transaction (i.e. one that had not finished committing when the BEGIN READONLY transaction started). If such an entry can be found, the client uses the associated data instead of reading from the db file. Or, if no such entry is found, the client:
1. Increments the number of BEGIN READONLY read-locks on the page.
2. Reads the contents of the page from the database file.
3. Decrements the number of BEGIN READONLY read-locks on the page.
The mutex used to protect access to the array of locking slots and the shared hash table is relinquished for step 2 above.
After each transaction is commited in single-process mode, the client searches the hash table for entries that can be discarded. An entry can be discarded if it has a transaction-id older than any still in use (either by BEGIN READONLY transactions or committers).

2.2 Multi-Process Mode

Multi-process mode differs from single-process mode in two important ways:

Individual clients may fail mid-transaction and the system must recover from this.
Partly as a consequence of the above, there are no convenient primitives like mutexes or malloc() with which to build complicated data structures like the hash-table used in single-process mode. As a result, there is no support for "BEGIN READONLY" transactions in multi-process mode.

Unlike single-process mode clients, which may be assigned a different client-id for each transaction, clients in multi-process mode are assigned a client-id when they connect to the database and do not relinquish it until they disconnect. As such, a database in multi-process server-mode supports at most 16 concurrent client connections.

As well as the array of locking slots, the shared-memory mapping used by clients in multi-process mode contains 16 "client slots". When a client connects, it takes a posix WRITE lock on the client slot that corresponds to its client id. This lock is not released until the client disconnects. Additionally, whenever a client starts a transaction, it sets the value in its client locking slot to 1, and clears it again after the transaction is concluded.

This assists with handling client failure mid-transaction in two ways:

If client A cannot obtain a lock due to a conflicting lock held by client B, it can check whether or not client B has failed by attempting a WRITE lock on its client locking slot. If successful, then client B must have failed and client A may:
- Roll back client B's journal, and
- By iterating through the entire locking slot array, release all locks held by client B when it failed.
When a client first connects and locks its client locking slot, it can check whether or not the previous user of the client locking slot failed mid-transaction (since if it did, the locking slot value will still be non-zero). If it did, the new owner of the client locking slot can release any locks and roll back any hot-journal before proceeding.

2.3 Required VFS Support

The server-mode extension requires that the VFS support various special file-control commands. Currently support is limited to the "unix" VFS.

SQLITE_FCNTL_SERVER_MODE

This is used by SQLite to query the VFS as to whether the connection should use single-process server-mode, multi-process server-mode, or continue in legacy mode.

SQLite invokes this file-control as part of the procedure for detecting a hot journal (after it has established that there is a file-system entry named <database>-journal and that no other process holds a RESERVED lock). If the <database>-journal directory is present in the file-system and the current VFS takes an exclusive lock on the database file (i.e. is "unix-excl"), then this file-control indicates that the connection should use single-process server-mode. Or, if the directory exists but the VFS does not take an exclusive lock on the database file, that the connection should use multi-proces server-mode. Or, if there is no directory of the required name, that the connection should use legacy mode.

SQLITE_FCNTL_FILEID

Return a 128-bit value that uniquely identifies an open file on disk from the VFS. This is used to ensure that all connections to the same database from within a process use the same shared state, even if they connect to the db using different file-system paths.

SQLITE_FCNTL_SHMOPEN

SQLITE_FCNTL_SHMOPEN2

SQLITE_FCNTL_SHMLOCK

SQLITE_FCNTL_SHMCLOSE

3.0 Problems and Issues

Writer starvation might be the biggest issue. How can it be prevented?
Blocking locks of some sort would likely improve things. The issue here is deadlock detection.
The limit of 16 concurrent clients in multi-process mode could be raised to 27 (since the locking-slot bits used for BEGIN READONLY locks in single-process mode can be reassigned to support more read/write client read-locks).

4.0 Performance Test

The test uses a single table with the following schema:

  CREATE TABLE t1(a INTEGER PRIMARY KEY, b BLOB(16), c BLOB(16), d BLOB(400));
  CREATE INDEX i1 ON t1(b);
  CREATE INDEX i2 ON t1(c);

The database initially contains 5,000,000 rows. Values for column "a" are between 1 and 5,000,000, inclusive. Other columns are populated with randomly generated blob values, each 16, 16, and 400 bytes in size, respectively.

Read/write transactions used by the test take the following form. Each such transaction modifies approximately 25 pages (5 in the main table and 10 in each index), not accounting for tree rebalancing operations.

  BEGIN;
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
  COMMIT;

Read-only transactions are as follows:

  BEGIN READONLY;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
  END;

The performance test features one or more clients executing read/write transactions as fast as possible, and zero or more clients executing read-only transactions, also as fast as possible. All tests use the "unix-excl" VFS and all clients execute in a separate thread within the same process. The database is located on a tmpfs file-system.

In the table below "rw:" refers to the number of read-write clients, and "ro:" the number of read-only clients used by the test. The TPS values in brackets are the number of read-only transactions per second. All other values are read-write transactions per second.

The collision rate (percentage of attempted transactions that failed due to a page-level locking conflict) in all tests was between 1 and 2%. Failed transactions are not included in the TPS counts below.

Configuration TPS per client TPS total
3.20.0, journal_mode=persist, rw:1, ro:0 6886 6886
3.20.0, journal_mode=wal, rw:1, ro:0 5997 5997
begin-concurrent, rw:1, ro:0 5925 5925
begin-concurrent, rw:2, ro:0 4172 8345
begin-concurrent, rw:3, ro:0 2943 8831
begin-concurrent, rw:2, ro:1 3781 7562 (17473)
server-mode, rw:1, ro:0 6333 6333
server-mode, rw:2, ro:0 5693 11386
server-mode, rw:3, ro:0 5132 15398
server-mode, rw:2, ro:1 5303 10607 (28300)

Configuration	TPS per client	TPS total
3.20.0, journal_mode=persist, rw:1, ro:0	6886	6886
3.20.0, journal_mode=wal, rw:1, ro:0	5997	5997
begin-concurrent, rw:1, ro:0	5925	5925
begin-concurrent, rw:2, ro:0	4172	8345
begin-concurrent, rw:3, ro:0	2943	8831
begin-concurrent, rw:2, ro:1	3781	7562 (17473)
server-mode, rw:1, ro:0	6333	6333
server-mode, rw:2, ro:0	5693	11386
server-mode, rw:3, ro:0	5132	15398
server-mode, rw:2, ro:1	5303	10607 (28300)