The "server-process-edition" Branch

The "server-process-edition" branch contains two modifications to stock SQLite that work together to provide concurrent read/write transactions using page-level-locking provided that:

Up to 16 simultaneous read/write transactions controlled by page-level-locking are possible. Additionally, there may be any number of read-only transactions started using "BEGIN READONLY" commands. Read-only transactions do not block read-write transactions, and read-write transactions do not block read-only transactions.

The two features on this branch are:

  1. An alternative layout for the database free-page list. This is intended to reduce contention between writers when allocating new database pages, either from the free-list or by extending the database file.

  2. The "server-mode" extension, which provides the read/write page-level-locking and read-only MVCC concurrency mentioned above.

Alternative Free-List Format

The alternative free-list format is very similar to the current format. It differs in the following respects:

This effectively means that a database has N free-lists instead of just one. To allocate a free page, a writer only needs to lock one such free-list, and so up to N transactions that allocate new pages may proceed concurrently.

Allocating pages from the end of the db file still locks out all other read/write transactions (because it involves writing to page 1, which every transaction needs to read). To minimize the frequency with which this occurs, when a page must be allocated from the end of the database file, the file is extended by 2048 pages. These are distributed equally between 16 free-lists (children of the free-list node page). Additionally, the first trunk page in each free list is never reused. Doing so would require writing to the free-list node page - effectively an exclusive lock on the entire page-allocation system.

The current format used for the free-list can be modified or queried using a new pragma:

  PRAGMA [database.]freelist_format;
  PRAGMA [database.]freelist_format = 1|2;

At present, the free-list format may only be modified when the free-list is completely empty. Which, as the implementation ensures that a free-list that uses the alternative format is never completely emptied, effectively precludes changing the format from 2 (alternative) to 1 (legacy).

Page level locking - "Server Mode"

A database client automatically enters "server mode" if (a) it is using a VFS that takes a process-wide exclusive lock on the db file (like "unix-excl" does), and (b) there exists a directory named "<database>-journal" in the file system alongside the database file "<database>" There is currently no provision for creating this directory, although it could be safely done for a database in rollback mode using something like:

  PRAGMA journal_mode = off;
  BEGIN EXCLUSIVE;
    <create directory>
  END;

To check the status of these two conditions, a new file-control is added - SQLITE_FCNTL_SERVER_MODE. SQLite invokes this file-control as part of the procedure for detecting a hot journal (after it has established that there is a file-system entry named <database>-journal and that no other process holds a RESERVED lock). If the VFS does support an exclusive process-wide lock and if the directory is present, the VFS indicates that the client should enter server mode. If the VFS does not indicate this, or if it returns SQLITE_NOTFOUND, then SQLite proceeds with the hot-journal rollback.

There is also a new file-control named SQLITE_FCNTL_FILEID, which requests a 128-bit value that uniquely identifies an open file on disk from the VFS. This is used to ensure that all connections to the same database from within a process use the same shared state, even if they connect to the db using different file-system paths.

The heap-memory data structure shared between all connections to the same database is protected by a mutex. Clients take and release the mutex each time a transaction is opened or closed, and each time a read or write lock is taken on a specific database page. Ordinary read/write transactions lock each page that they access - each page can support any number of concurrent read locks or a single write lock.

Write transactions use a journal file stored in the <database>-journal directory. Journal files are named "<id>-journal", where <id> is an integer value betwen 0 and 15, inclusive. A client may use multiple different journal files throughout its lifetime.

Before database pages are overwritten in server-mode, entries are added to an in-memory hash table containing the old page content. These entries are used by read-only transactions to ensure that they access a consistent snapshot of the database. Hash table entries are automatically removed when they are no longer required.

It is not difficult to extend the kind of page level locking used by read/write transactions to clients in multiple processes. It might be more difficult to extend the read-only MVCC capability though.

Performance Test

The test uses a single table with the following schema:

  CREATE TABLE t1(a INTEGER PRIMARY KEY, b BLOB(16), c BLOB(16), d BLOB(400));
  CREATE INDEX i1 ON t1(b);
  CREATE INDEX i2 ON t1(c);

The database initially contains 5,000,000 rows. Values for column "a" are between 1 and 5,000,000, inclusive. Other columns are populated with randomly generated blob values, each 16, 16, and 400 bytes in size, respectively.

Read/write transactions used by the test take the following form. Each such transaction modifies approximately 25 pages (5 in the main table and 10 in each index), not accounting for tree rebalancing operations.

  BEGIN;
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
  COMMIT;

Read-only transactions are as follows:

  BEGIN READONLY;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a>abs((random()%5000000)) LIMIT 10;
  END;

The performance test features one or more clients executing read/write transactions as fast as possible, and zero or more clients executing read-only transactions, also as fast as possible. All tests use the "unix-excl" VFS and all clients execute in a separate thread within the same process. The database is located on a tmpfs file-system.

In the table below "rw:" refers to the number of read-write clients, and "ro:" the number of read-only clients used by the test. The TPS values in brackets are the number of read-only transactions per second. All other values are read-write transactions per second.

The collision rate (percentage of attempted transactions that failed due to a page-level locking conflict) in all tests was between 1 and 2%. Failed transactions are not included in the TPS counts below.

Configuration TPS per client TPS total
3.20.0, journal_mode=persist, rw:1, ro:0 6886 6886
3.20.0, journal_mode=wal, rw:1, ro:0 5997 5997
begin-concurrent, rw:1, ro:05925 5925
begin-concurrent, rw:2, ro:04172 8345
begin-concurrent, rw:3, ro:02943 8831
begin-concurrent, rw:2, ro:13781 7562 (17473)
server-mode, rw:1, ro:0 6333 6333
server-mode, rw:2, ro:0 5693 11386
server-mode, rw:3, ro:05132 15398
server-mode, rw:2, ro:15303 10607 (28300)