SQLite: Artifact [b98409c486]

Artifact b98409c486d6f02871b20a9e29e1e18cd050a02e03062569ffb051774b4d6861:

File README-server-edition.html — part of check-in [754ad35cd2] at 2018-03-31 18:43:20 on branch server-process-edition — Fix an error in README-server-edition.html. (user: dan size: 19200)
<html>

<center>
  <h1> The "server-process-edition" Branch</h1>
</center>

<p>
The "server-process-edition" branch contains two modifications to stock 
SQLite that work together to provide concurrent read/write transactions 
using pessimistic page-level-locking. The system runs in two modes:

<ul>
  <li><p> Single-process mode - where all clients must be within the
          same address space, and 
  <li><p> Multi-process mode - where clients may be distributed between
          multiple OS processes.
</ul>

<p> The system is designed to be most efficient when used with 
<a href="https://www.sqlite.org/pragma.html#pragma_synchronous">
  "PRAGMA synchronous=OFF"</a>, although it does not require this.

<p>
Up to 16 simultaneous read/write transactions controlled by page-level-locking
are possible. Additionally, in single-process mode there may be any number of
read-only transactions started using the "BEGIN READONLY" command. Read-only
transactions do not block read-write transactions, and read-write transactions
do not block read-only transactions. Read-only transactions access a consistent
snapshot of the database - writes committed by other clients after the
transaction has started are never visible to read-only transactions. In
multi-process mode, the "BEGIN READONLY" command is equivalent to a stock
"BEGIN".

<p>
The two features on this branch are:
<ol>
  <li><p> An 
    <a href=#freelist>alternative layout for the database free-page list</a>. 
    This is intended to reduce contention between writers when allocating
    new database pages, either from the free-list or by extending the
    database file.

  <li><p> The <a href=#servermode>"server-mode" extension</a>, which
    provides read/write page-level-locking concurrency and (in
    single-process mode) read-only MVCC concurrency mentioned above.
</ol>


<h2 id=freelist> 1.0 Alternative Free-List Format </h2>

<p>
The alternative free-list format is very similar to the current format. It
differs in the following respects:

<ul>
  <li><p>The "total number of free pages" field in the db header is not
       maintained. It is always set to zero.
  <li><p> Instead of pointing to the first free-list trunk page, the free-list
       pointer in the db header points to a page known as the "free-list node".
  <li><p> The free-list node page contains N pointers to free-lists stored in the
       legacy format (i.e. a linked list of trunk pages each containing
       pointers to many free leaf pages).
</ul>

<p>
This effectively means that a database has N free-lists instead of just one. To
allocate a free page, a writer only needs to lock one such free-list, and so up
to N transactions that allocate new pages may proceed concurrently.

<p>
Allocating pages from the end of the db file still locks out all other
read/write transactions (because it involves writing to page 1, which every
transaction needs to read). To minimize the frequency with which this occurs,
when a page must be allocated from the end of the database file, the file is
extended by 2048 pages. These are distributed equally between 16 free-lists
(children of the free-list node page). Additionally, the first trunk page in
each free list is never reused.  Doing so would require writing to the
free-list node page - effectively an exclusive lock on the entire
page-allocation system.

<p>
The current format used for the free-list can be modified or queried using a
new pragma:

<pre>
  PRAGMA [database.]freelist_format;
  PRAGMA [database.]freelist_format = 1|2;
</pre>

<p>
At present, the free-list format may only be modified when the free-list is
completely empty. Which, as the implementation ensures that a free-list that
uses the alternative format is never completely emptied, effectively precludes
changing the format from 2 (alternative) to 1 (legacy).

<p>
For databases that use the "alternative" free-list format, the read and write
versions in the database header (byte offsets 18 and 19) are set to 3 for
rollback mode or 4 for wal mode (instead of 1 and 2 respectively).

<h2 id=servermode> 2.0 Page level locking - "Server Mode" </h2>

<p>
A database client automatically enters "server mode" if there exists a
<i>directory</i> named "&lt;database&gt;-journal" in the file system alongside
the database file "&lt;database&gt;" There is currently no provision for
creating this directory, although it could be safely done for a database in
rollback mode using something like:

<pre>
  PRAGMA journal_mode = off;
  BEGIN EXCLUSIVE;
    &lt;create directory&gt;
  END;
</pre>

<p> As well as signalling new clients that they should enter server-mode,
creating a directory named "&lt;database&gt;-journal" has the helpful 
side-effect of preventing legacy clients from accessing the database file at
all.

<p> If the VFS is one that takes an exclusive lock on the db file (to 
guarantee that no other process accesses the db file), then the system
automatically enters single-process mode. Otherwise, multi-process mode.

<p> In both single and multi-process modes, page-level-locking is managed
by allocating a fixed-size array of "locking slots". Each locking slot is
32-bits in size. By default, the array contains 262144 (2^18) slots. Pages are
assigned to locking slots using the formula (pgno % 262144) - so pages 1,
262145, 524289 etc. share a single locking slot.

<p> In single-process mode, the array of locking slots is allocated on 
the process heap and access is protected by a mutex. In multi-process mode, it
is created by memory-mapping a file on disk (similar to the *-shm file in
SQLite wal mode) and access is performed using 
<a href="https://en.wikipedia.org/wiki/Compare-and-swap">atomic CAS
  primitives</a> exclusively.

<p> Each time a read/write transaction is opened, the client assumes a client 
id between 0 and 15 for the duration of the transaction. Client ids are unique
at any point in time - concurrently executing transactions must use different
client ids. So there may exist a maximum of 16 concurrent read/write
transactions at any one time.

<p> Read/write transactions in server-mode are similar to regular SQLite
transactions in rollback mode. The most significant differences are that:

<ul>
  <li> <p>Instead of using journal file &lt;database&gt;-journal, server-mode
    clients use &lt;database&gt;-journal/&lt;client-id&gt;-journal. If 
    there are multiple concurrent transactions, each uses a separate 
    journal file.

    <li> <p>No database-wide lock is taken. Instead, individual read and write
    locks are taken on the pages accessed by the transaction.
</ul>

<p> Each locking slot is 32-bits in size. A locking slot may simultaneously
support a single write-lock, up to 16 read-locks from read/write clients, and
(in single process mode) up 1024 read-locks from "BEGIN READONLY" clients.
Locking slot bits are used as follows:

<ul>
  <li> <p> The least-significant 16-bits are used for read-locks taken by
    read/write clients. To take a read-lock, bit &lt;client-id&gt; of the
    locking slot is set.

  <li> <p> The next 5 bytes are used for the write-lock. If no write-lock
    is held on the slot, then this 5 byte integer is set to 0. Otherwise,
    it is set to (<i>C</i> + 1), where <i>C</i> is the &lt;client-id&gt; of
    the client holding the write-lock.

  <li> <p> The next 10 bits contain the total number of read-locks held by
    "BEGIN READONLY" clients on the locking slot. See the section below
    for a description of how these are used.
</ul>

<p> Currently, if a client requests a lock that cannot be granted due to
a conflicting lock, SQLITE_BUSY is returned to the caller and either the
entire transaction or statement transaction must be rolled back. See 
<a href=#problems>Problems and Issues</a> below for more details.

<h3> 2.1 Single-Process Mode </h3>

<p> Single process mode is simpler than multi-process mode because it does
not have to deal with runtime client failure - it is assumed that if one
client fails mid-transaction the entire process crashes. As a result the 
only time hot-journal rollback is required in single-process mode is as
part of startup. The first client to connect to a database in single-process
mode attempts to open and rollback all 16 potential hot journal files.

<p> But, in order to support non-blocking "BEGIN READONLY" transactions, it is
also in some ways more complicated than multi-process mode. "BEGIN READONLY"
support works as follows:

<ul>

  <li> <p>In single-process mode, writers never spill the cache mid-transaction.
    Data is only written to the database as part of committing a transaction.

  <li> <p>As well as writing the contents of overwritten pages out to the journal
    file, a writer in single-process mode also accumulates a list of buffers
    containing the original data for each page overwritten by the current
    transaction in main-memory.

  <li> <p>When a transaction is ready to be committed, a writer obtains a
    transaction-id. Transaction-ids are assigned to writers using a
    monotonically increasing function. The writer then adds all of its "old
    data" buffers to a hash table accessible to all database clients.
    Associated with each hash table entry is the newly assigned transaction-id.
    It then waits (spin-locks) for all "BEGIN READONLY" read-locks to clear on
    all pages that will be written out by the transaction. Following this, it
    commits the transaction as normal (writes out the dirty pages and zeroes
    the journal file header).

  <li> <p>Clients executing "BEGIN READONLY" transactions are not assigned
    a &lt;client-id&gt;. Instead, they are assigned a transaction-id that is 
    either (a) that of the oldest transaction-id belonging to a writer that has
    not yet finished committing, or (b) if there are currently no writers
    committing then the value that will be assigned to the next committer.

  <li> <p>When a "BEGIN READONLY" transaction reads a page, it first checks
    the aforementioned hash table for a suitable entry. A suitable entry
    is one with the right page-number and a transaction-id greater than or
    equal to that of the "BEGIN READONLY" transaction (i.e. one that had not
    finished committing when the BEGIN READONLY transaction started). If such 
    an entry can be found, the client uses the associated data instead of
    reading from the db file. Or, if no such entry is found, the client:
    <ol>
      <li> Increments the number of BEGIN READONLY read-locks on the page.
      <li> Reads the contents of the page from the database file.
      <li> Decrements the number of BEGIN READONLY read-locks on the page.
    </ol>
    <p> The mutex used to protect access to the array of locking slots and
    the shared hash table is relinquished for step 2 above.

  <li> <p>After each transaction is commited in single-process mode, the 
    client searches the hash table for entries that can be discarded. An
    entry can be discarded if it has a transaction-id older than any still
    in use (either by BEGIN READONLY transactions or committers).
</ul>

<h3> 2.2 Multi-Process Mode </h3>

<p> Multi-process mode differs from single-process mode in two important ways:

<ul>
  <li> <p>Individual clients may fail mid-transaction and the system must recover
    from this.

  <li> <p>Partly as a consequence of the above, there are no convenient 
    primitives like mutexes or malloc() with which to build complicated data
    structures like the hash-table used in single-process mode. As a result,
    there is no support for "BEGIN READONLY" transactions in multi-process
    mode.  
</ul>

<p> Unlike single-process mode clients, which may be assigned a different
client-id for each transaction, clients in multi-process mode are assigned a
client-id when they connect to the database and do not relinquish it until
they disconnect. As such, a database in multi-process server-mode supports
at most 16 concurrent client connections.

<p> As well as the array of locking slots, the shared-memory mapping used
by clients in multi-process mode contains 16 "client slots". When a client
connects, it takes a posix WRITE lock on the client slot that corresponds
to its client id. This lock is not released until the client disconnects.
Additionally, whenever a client starts a transaction, it sets the value 
in its client locking slot to 1, and clears it again after the transaction
is concluded.

<p> This assists with handling client failure mid-transaction in two ways:

<ul>
  <li><p> If client A cannot obtain a lock due to a conflicting lock held by
    client B, it can check whether or not client B has failed by attempting a
    WRITE lock on its client locking slot. If successful, then client B must
    have failed and client A may: 
    <ul>
      <li> Roll back client B's journal, and
      <li> By iterating through the entire locking slot array, release all
        locks held by client B when it failed.
    </ul>

  <li><p> When a client first connects and locks its client locking slot, it
    can check whether or not the previous user of the client locking slot failed
    mid-transaction (since if it did, the locking slot value will still be
    non-zero). If it did, the new owner of the client locking slot can release
    any locks and roll back any hot-journal before proceeding.
</ul>

<h3> 2.3 Required VFS Support </h3>

<p> The server-mode extension requires that the VFS support various special
file-control commands. Currently support is limited to the "unix" VFS.

<dl>
  <dt> SQLITE_FCNTL_SERVER_MODE
  <dd><p> This is used by SQLite to query the VFS as to whether the
  connection should use single-process server-mode, multi-process server-mode,
  or continue in legacy mode.

  <p>SQLite invokes this file-control as part of the procedure for detecting a
  hot journal (after it has established that there is a file-system entry named
  &lt;database&gt;-journal and that no other process holds a RESERVED lock).
  If the &lt;database&gt;-journal directory is present in the file-system and
  the current VFS takes an exclusive lock on the database file (i.e. is
  "unix-excl"), then this file-control indicates that the connection should use
  single-process server-mode. Or, if the directory exists but the VFS does not
  take an exclusive lock on the database file, that the connection should use
  multi-proces server-mode. Or, if there is no directory of the required name,
  that the connection should use legacy mode.

  <dt> SQLITE_FCNTL_FILEID
  <dd><p> Return a 128-bit value that uniquely identifies an open file on disk
  from the VFS. This is used to ensure that all connections to the same
  database from within a process use the same shared state, even if they
  connect to the db using different file-system paths.

  <dt> SQLITE_FCNTL_SHMOPEN
  <dd>

  <dt> SQLITE_FCNTL_SHMOPEN2
  <dd>

  <dt> SQLITE_FCNTL_SHMLOCK
  <dd>

  <dt> SQLITE_FCNTL_SHMCLOSE
  <dd>
</dl>


<h2 id=problems> 3.0 Problems and Issues </h2>

<ul>

  <li> <p>Writer starvation might be the biggest issue. How can it be
       prevented?

  <li> <p>Blocking locks of some sort would likely improve things. The issue
       here is deadlock detection.

  <li> <p>The limit of 16 concurrent clients in multi-process mode could be 
       raised to 27 (since the locking-slot bits used for BEGIN READONLY
       locks in single-process mode can be reassigned to support more 
       read/write client read-locks).

</ul>

<h2> 4.0 Performance Test </h2>

<p>
The test uses a single table with the following schema:

<pre>
  CREATE TABLE t1(a INTEGER PRIMARY KEY, b BLOB(16), c BLOB(16), d BLOB(400));
  CREATE INDEX i1 ON t1(b);
  CREATE INDEX i2 ON t1(c);
</pre>

<p>
The database initially contains 5,000,000 rows. Values for column "a" are
between 1 and 5,000,000, inclusive. Other columns are populated with randomly
generated blob values, each 16, 16, and 400 bytes in size, respectively.

<p>
Read/write transactions used by the test take the following form. Each such
transaction modifies approximately 25 pages (5 in the main table and 10 in each
index), not accounting for tree rebalancing operations.

<pre>
  BEGIN;
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
    REPLACE INTO t1 VALUES(abs(random() % 5000000), randomblob(16), randomblob(16), randomblob(400));
  COMMIT;
</pre>

<p>
Read-only transactions are as follows:

<pre>
  BEGIN READONLY;
    SELECT * FROM t1 WHERE a&gt;abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a&gt;abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a&gt;abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a&gt;abs((random()%5000000)) LIMIT 10;
    SELECT * FROM t1 WHERE a&gt;abs((random()%5000000)) LIMIT 10;
  END;
</pre>

<p>
The performance test features one or more clients executing read/write
transactions as fast as possible, and zero or more clients executing read-only
transactions, also as fast as possible. All tests use the "unix-excl" VFS and
all clients execute in a separate thread within the same process. The database
is located on a tmpfs file-system.

<p>
In the table below "rw:" refers to the number of read-write clients, and "ro:"
the number of read-only clients used by the test. The TPS values in brackets
are the number of read-only transactions per second. All other values are 
read-write transactions per second.

<p>
The collision rate (percentage of attempted transactions that failed due to a
page-level locking conflict) in all tests was between 1 and 2%. Failed
transactions are not included in the TPS counts below.

<p>
<table border=1 width=90% align=center>
<!--
  320 jm=persist, rw:1          139675, 135797
  320 jm=wal, rw:1              120995, 118889
  begin-concurrent, rw:1        119438, 117580
  begin-concurrent, rw:2        166923, 166904
  begin-concurrent, rw:3        180432, 172825
  begin-concurrent, rw:2,ro:1   150427(347319),152087(351601) 
  server-mode, rw:1             126592, 126742
  server-mode, rw:2             228317, 227155
  server-mode, rw:3             309712, 306218
  server-mode, rw:2,ro:1        213303(576032),210994(556005) 
-->

  <tr><th> Configuration <th>TPS per client <th> TPS total
  <tr><td> 3.20.0, journal_mode=persist, rw:1, ro:0 <td>6886 <td> 6886
  <tr><td> 3.20.0, journal_mode=wal, rw:1, ro:0 <td>5997 <td> 5997
  <tr><td> begin-concurrent, rw:1, ro:0<td>5925<td> 5925
  <tr><td> begin-concurrent, rw:2, ro:0<td>4172 <td> 8345
  <tr><td> begin-concurrent, rw:3, ro:0<td>2943 <td> 8831
  <tr><td> begin-concurrent, rw:2, ro:1<td>3781 <td> 7562 (17473)
  <tr><td> server-mode, rw:1, ro:0 <td>6333 <td> 6333
  <tr><td> server-mode, rw:2, ro:0<td> 5693 <td> 11386
  <tr><td> server-mode, rw:3, ro:0<td>5132 <td> 15398
  <tr><td> server-mode, rw:2, ro:1<td>5303 <td> 10607 (28300)
</table>