The "server-process-edition" Branch

The "server-process-edition" branch contains two modifications to stock SQLite that work together to provide concurrent read/write transactions using pessimistic page-level-locking. The system runs in two modes:

The system is designed to be most efficient when used with "PRAGMA synchronous=OFF", although it does not require this.

Up to 16 simultaneous read/write transactions controlled by page-level-locking are possible. Additionally, in single-process mode there may be any number of read-only transactions started using the "BEGIN READONLY" command. Read-only transactions do not block read-write transactions, and read-write transactions do not block read-only transactions. Read-only transactions access a consistent snapshot of the database - writes committed by other clients after the transaction has started are never visible to read-only transactions. In multi-process mode, the "BEGIN READONLY" command is equivalent to a stock "BEGIN".

The two features on this branch are:

  1. An alternative layout for the database free-page list. This is intended to reduce contention between writers when allocating new database pages, either from the free-list or by extending the database file.

  2. The "server-mode" extension, which provides read/write page-level-locking concurrency and (in single-process mode) read-only MVCC concurrency mentioned above.

1.0 Alternative Free-List Format

The alternative free-list format is very similar to the current format. It differs in the following respects:

This effectively means that a database has N free-lists instead of just one. To allocate a free page, a writer only needs to lock one such free-list, and so up to N transactions that allocate new pages may proceed concurrently.

Allocating pages from the end of the db file still locks out all other read/write transactions (because it involves writing to page 1, which every transaction needs to read). To minimize the frequency with which this occurs, when a page must be allocated from the end of the database file, the file is extended by 2048 pages. These are distributed equally between 16 free-lists (children of the free-list node page). Additionally, the first trunk page in each free list is never reused. Doing so would require writing to the free-list node page - effectively an exclusive lock on the entire page-allocation system.

The current format used for the free-list can be modified or queried using a new pragma:

  PRAGMA [database.]freelist_format;
  PRAGMA [database.]freelist_format = 1|2;

At present, the free-list format may only be modified when the free-list is completely empty. Which, as the implementation ensures that a free-list that uses the alternative format is never completely emptied, effectively precludes changing the format from 2 (alternative) to 1 (legacy).

For databases that use the "alternative" free-list format, the read and write versions in the database header (byte offsets 18 and 19) are set to 3 for rollback mode or 4 for wal mode (instead of 1 and 2 respectively).

2.0 Page level locking - "Server Mode"

A database client automatically enters "server mode" if there exists a directory named "<database>-journal" in the file system alongside the database file "<database>" There is currently no provision for creating this directory, although it could be safely done for a database in rollback mode using something like:

  PRAGMA journal_mode = off;
  BEGIN EXCLUSIVE;
    <create directory>
  END;

As well as signalling new clients that they should enter server-mode, creating a directory named "<database>-journal" has the helpful side-effect of preventing legacy clients from accessing the database file at all.

If the VFS is one that takes an exclusive lock on the db file (to guarantee that no other process accesses the db file), then the system automatically enters single-process mode. Otherwise, multi-process mode.

In both single and multi-process modes, page-level-locking is managed by allocating a fixed-size array of "locking slots". Each locking slot is 32-bits in size. By default, the array contains 262144 (2^18) slots. Pages are assigned to locking slots using the formula (pgno % 262144) - so pages 1, 262145, 524289 etc. share a single locking slot.

In single-process mode, the array of locking slots is allocated on the process heap and access is protected by a mutex. In multi-process mode, it is created by memory-mapping a file on disk (similar to the *-shm file in SQLite wal mode) and access is performed using atomic CAS primitives exclusively.

Each time a read/write transaction is opened, the client assumes a client id between 0 and 15 for the duration of the transaction. Client ids are unique at any point in time - concurrently executing transactions must use different client ids. So there may exist a maximum of 16 concurrent read/write transactions at any one time.

Read/write transactions in server-mode are similar to regular SQLite transactions in rollback mode. The most significant differences are that:

Each locking slot is 32-bits in size. A locking slot may simultaneously support a single write-lock, up to 16 read-locks from read/write clients, and (in single process mode) up 1024 read-locks from "BEGIN READONLY" clients. Locking slot bits are used as follows:

Currently, if a client requests a lock that cannot be granted due to a conflicting lock, SQLITE_BUSY is returned to the caller and either the entire transaction or statement transaction must be rolled back. See Problems and Issues below for more details.

2.1 Single-Process Mode

Single process mode is simpler than multi-process mode because it does not have to deal with runtime client failure - it is assumed that if one client fails mid-transaction the entire process crashes. As a result the only time hot-journal rollback is required in single-process mode is as part of startup. The first client to connect to a database in single-process mode attempts to open and rollback all 16 potential hot journal files.

But, in order to support non-blocking "BEGIN READONLY" transactions, it is also in some ways more complicated than multi-process mode. "BEGIN READONLY" support works as follows:

2.2 Multi-Process Mode

Multi-process mode differs from single-process mode in two important ways:

Unlike single-process mode clients, which may be assigned a different client-id for each transaction, clients in multi-process mode are assigned a client-id when they connect to the database and do not relinquish it until they disconnect. As such, a database in multi-process server-mode supports at most 16 concurrent client connections.

As well as the array of locking slots, the shared-memory mapping used by clients in multi-process mode contains 16 "client slots". When a client connects, it takes a posix WRITE lock on the client slot that corresponds to its client id. This lock is not released until the client disconnects. Additionally, whenever a client starts a transaction, it sets the value in its client locking slot to 1, and clears it again after the transaction is concluded.

This assists with handling client failure mid-transaction in two ways:

2.3 Required VFS Support

The server-mode extension requires that the VFS support various special file-control commands. Currently support is limited to the "unix" VFS.

SQLITE_FCNTL_SERVER_MODE

This is used by SQLite to query the VFS as to whether the connection should use single-process server-mode, multi-process server-mode, or continue in legacy mode.

SQLite invokes this file-control as part of the procedure for detecting a hot journal (after it has established that there is a file-system entry named <database>-journal and that no other process holds a RESERVED lock). If the <database>-journal directory is present in the file-system and the current VFS takes an exclusive lock on the database file (i.e. is "unix-excl"), then this file-control indicates that the connection should use single-process server-mode. Or, if the directory exists but the VFS does not take an exclusive lock on the database file, that the connection should use multi-proces server-mode. Or, if there is no directory of the required name, that the connection should use legacy mode.

SQLITE_FCNTL_FILEID

Return a 128-bit value that uniquely identifies an open file on disk from the VFS. This is used to ensure that all connections to the same database from within a process use the same shared state, even if they connect to the db using different file-system paths.

SQLITE_FCNTL_SHMOPEN
SQLITE_FCNTL_SHMOPEN2
SQLITE_FCNTL_SHMLOCK
SQLITE_FCNTL_SHMCLOSE

3.0 Problems and Issues

4.0 Performance Test

The test uses a single table with the following schema:

  CREATE TABLE t1(a INTEGER PRIMARY KEY, b BLOB(16), c BLOB(16), d BLOB(400));
  CREATE INDEX i1 ON t1(b);
  CREATE INDEX i2 ON t1(c);

The database initially contains 5,000,000 rows. Values for column "a" are between 1 and 5,000,000, inclusive. Other columns are populated with randomly generated blob values, each 16, 16, and 400 bytes in size, respectively.

Read/write transactions used by the test take the following form. Each such transaction modifies approximately 25 pages (5 in the main table and 10 in each index), not accounting for tree rebalancing operations.

  BEGIN;
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
  COMMIT;

Read-only transactions are as follows:

  BEGIN READONLY;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
  END;

The performance test features one or more clients executing read/write transactions as fast as possible, and zero or more clients executing read-only transactions, also as fast as possible. All tests use the "unix-excl" VFS and all clients execute in a separate thread within the same process. The database is located on a tmpfs file-system.

In the table below "rw:" refers to the number of read-write clients, and "ro:" the number of read-only clients used by the test. The TPS values in brackets are the number of read-only transactions per second. All other values are read-write transactions per second.

The collision rate (percentage of attempted transactions that failed due to a page-level locking conflict) in all tests was between 1 and 2%. Failed transactions are not included in the TPS counts below.

Configuration TPS per client TPS total
3.20.0, journal_mode=persist, rw:1, ro:0 6886 6886
3.20.0, journal_mode=wal, rw:1, ro:0 5997 5997
begin-concurrent, rw:1, ro:05925 5925
begin-concurrent, rw:2, ro:04172 8345
begin-concurrent, rw:3, ro:02943 8831
begin-concurrent, rw:2, ro:13781 7562 (17473)
server-mode, rw:1, ro:0 6333 6333
server-mode, rw:2, ro:0 5693 11386
server-mode, rw:3, ro:05132 15398
server-mode, rw:2, ro:15303 10607 (28300)