hctree

Replication APIs
Login

Replication APIs

Leader/Follower Replication Details

This page builds on the speculations here.

  1. Replicated Database Schema
  2. Initializing a Replicated Database
  3. Selecting Leader/Follower Mode
  4. Automatic Journalling For Leaders
  5. Applying Changes To Follower Databases
  6. Validation Callback For Leader Nodes
  7. Truncating the Journal Table
  8. Hash and Synchronization Related Functions
  9. Application Programming Notes
  10. Table Details
  11. Data Format

1. Replicated Database Schema

A replicated database, one that supports leader/follower replication, contains the following two system tables:

    -- The "journal" table.
    CREATE TABLE sqlite_hct_journal(
      cid INTEGER PRIMARY KEY,
      schema TEXT,
      data BLOB,
      schemacid INTEGER,
      hash BLOB,
      tid INTEGER,
      validcid INTEGER
    );

    -- The "baseline" table.
    CREATE TABLE sqlite_hct_baseline(
      cid INTEGER,
      schemacid INTEGER,
      hash BLOB
    );

By default, the sqlite_hct_journal table - hereafter the "journal table" - contains one entry for each transaction that has been written to the database since the beginning of time. The four most interesting columns of the journal table entry are:

The current state of the database may be reconstructed by starting with an empty database, and then for each journal table entry, in order of ascending "cid":

  1. Executing the "schema" SQL script, and then
  2. Updating the database tables according to the contents of the "data" field.

The journal and baseline tables may be read like any other database table using SQL SELECT statements. However they behave differently with respect to transaction isolation. Specifically, all data committed to the journal or baseline table becomes instantly visible to all SQL clients, even if the write transaction is not included in the client's snapshot.

It is an error (SQLITE_READONLY) to attempt to write to the journal or baseline table directly. They may only be modified indirectly, using the interfaces described in the following sections.

1.1. Truncated Journals

In practice, storing data for every transaction executed during the lifetime of a database would soon become cumbersome. The API allows the journal to be truncated. Truncating the journal atomically:

The baseline table always contains a single row, summarizing all transactions that have been removed from the journal table in the lifetime of the database. Initally, the baseline table is populated as if:

    INSERT INTO sqlite_hct_baseline(0, 0, zeroblob(16));
Generally, the columns of the single row in the baseline table are populated as follows:

1.2. Holes in the Journal

Because transactions may be executed concurrently on both leader and follower nodes, a transaction with CID value C may be added to the journal table before transaction C-1. This means that in general, a journal may contain something other than a contiguous set of CID values starting from (baseline.cid+1). In this case, the missing entries are refered to as "holes in the journal".

2. Initializing a Replicated Database

To participate in leader/follower replication, a database must be configured by calling the following function:

    /*
    ** Initialize the main database for replication. Return SQLITE_OK if
    ** successful. Otherwise, return an SQLite error code and leave an
    ** English language error message in the db handle for sqlite3_errmsg().
    */
    int sqlite3_hct_journal_init(sqlite3 *db);

This API call creates the journal and baseline tables in the database, and adds the initial row to the baseline table.

When this API is called, the following must be true:

Some attempt is made to verify the three conditions above, but race conditions are possible. For example, if another thread opens a connection to the database while sqlite3_hct_journal_init() is being called, database corruption may follow.

3. Selecting Leader/Follower Mode

Each replication-enabled database opened by a process is at all times in either LEADER or FOLLOWER mode. In FOLLOWER mode, it is an error to attempt to write to the database using the SQL interface. In LEADER mode, it is an error to call the sqlite3_hct_journal_write() or sqlite3_hct_journal_snapshot() interfaces.

The LEADER/FOLLOWER setting is per-database, not per-database handle. If a process has multiple handles open on the same replication enabled database, then all handles share a single LEADER/FOLLOWER setting. Modifying the setting via one handle modifies it for them all.

    #define SQLITE_HCT_JOURNAL_MODE_FOLLOWER 0
    #define SQLITE_HCT_JOURNAL_MODE_LEADER   1

    /* 
    ** Query the LEADER/FOLLOWER setting of the db passed as the 
    ** first argument. Return either an SQLITE_HCT_JOURNAL_MODE_XXX constant,
    ** or else a -1 to indicate that the main database of handle db is not a
    ** replication enabled hct database.
    */
    int sqlite3_hct_journal_mode(sqlite3 *db);

    /*
    ** Set the LEADER/FOLLOWER mode of the db passed as the first argument.
    ** Return SQLITE_OK if the db mode is successfully changed, or if
    ** it does not need to be changed because the requested mode is the
    ** same as the current mode. Otherwise, return an SQLite error code
    ** and leave an English language error message in the database handle
    ** for sqlite3_errmsg().
    **
    ** It is safe to call this function while there are ongoing read
    ** transactions. However, this function may not be called concurrently
    ** with any write transaction or sqlite3_hct_journal_write() call on
    ** the same database (database, not just database handle). Doing so
    ** may cause database corruption.
    */
    int sqlite3_hct_journal_setmode(sqlite3 *db, int eMode);

When a replication database is first opened, or when it is first marked as a replication database by a call to sqlite3_hct_journal_init(), it is in FOLLOWER mode.

It may then be changed to LEADER mode using the API. Except, the mode may only be changed to LEADER if there are no holes in the journal. A database in leader mode may be changed to FOLLOWER mode at any time using the above API.

4. Automatic Journalling For Leader Nodes

By default, whenever a transaction is committed to a database configured for replication in LEADER mode, an entry is automatically added to the journal table.

In some cases, when a commit fails during transaction validation (after a CID has been allocated), an entry is written to the journal table even though the transaction has been rolled back. In this case both the "schema" and "data" fields are 0 bytes in size.

5. Applying Changes To Follower Databases

To copy a journal entry from one database to another, the first 4 fields of the journal entry must be passed to the following API:

    /*
    ** Write a transaction into the database.
    */
    int sqlite3_hct_journal_write(
        sqlite3 *db,                   /* Write to "main" db of this handle */
        i64 iCid,
        const char *zSchema,
        const void *pData, int nData,
        i64 iSchemaCid
    );
The database must be in FOLLOWER mode when this function is called. If successful, a single call to sqlite3_hct_journal_write() atomically updates both the journal table and, according to the zSql and pData/nData arguments, the database itself.

If the journal table already contains an entry with "cid" value iCid, then the call fails with an SQLITE_CONSTRAINT error. Or, if the iSchemaCid value is not compatible with the current contents of the journal, this call also fails with SQLITE_CONSTRAINT.

Concurrency

Each entry of the journal table represents a transaction. The contents of a journal table, sorted in CID order, may be looked at as a series of groups of transactions, where all members of a group have the same value for the schemacid column. A group consists of:

A schema transaction may not be applied concurrently with any other entry. Before the schema transaction with cid=C can be applied, all transactions with cid values less than C must have been completed. And schema transaction C must have completed before any transaction with a cid value greater than C can be applied. Any call to sqlite3_hct_journal_write() that violates these ordering rules fails with an SQLITE_SCHEMA error. The application should interpret this as "try that one again later".

However, within a group, non-schema transactions may be applied in any order, using any number of threads (each with its own db handle).

Snapshot Availability

In a distributed system, it may be desired not to run a query on a follower node until a certain transaction from the leader node has been propagated and made available on the follower. For example, if a single client performs a database write followed by a database read, then the results of the write should be visible to the read, even if the read is performed on a follower node.

The following API may be used to query for the current snapshot available to readers on a follower node. If it is not new enough, the caller must wait until further transactions have been applied to the db via sqlite3_hct_journal_write(), then retry this API call.

    /*
    ** Set output variable (*piCid) to the CID of the newest available 
    ** database snapshot. Return SQLITE_OK if successful, or an SQLite
    ** error code if something goes wrong.
    */
    int sqlite3_hct_journal_snapshot(sqlite3 *db, i64 *piCid);
In practice, the available snapshot S is the largest value for which all journal entries with CID values S or smaller are already present in the journal table.

6. Validation Callback for Leader Nodes

The sqlite3_hct_journal_validation_hook() API is used to register a custom validation callback with a database handle. If one is registered, the custom validation callback is invoked each time a transaction is commited to an hctree database in LEADER mode.

More specifically, the validation hook is invoked after all new keys have already been written into the database, and after internal transaction validation has succeeded. All that remains to commit the transaction is:

If the validation hook returns 0, then the commit proceeds as normal. Or, if it returns non-zero, then the transaction is rolled back and SQLITE_BUSY_SNAPSHOT returned to the user, just as if internal transaction validation had failed.

    /*
    ** Register a custom validation callback with the database handle.
    */
    int sqlite3_hct_journal_validation_hook(
        sqlite3 *db,
        void *pArg,
        int(*xValidate)(
            void *pCopyOfArg,
            i64 iCid
            const char *zSchema,
            const void *pData, int nData,
            i64 iSchemaCid
        )
    );
The first argument passed to the validation callback is a copy of the context pointer supplied by the application as the second argument to sqlite3_hct_journal_validation_hook(). The values passed to the following arguments are identical to the leftmost 4 fields of the entry that will be inserted into the journal table for the transaction, assuming it is committed.

The custom validation hook may be used for two purposes:

  1. As an efficient way to obtain the 4 values that must be propagated to follower nodes for each transaction, and

  2. To perform custom validation. An example of custom validation that might be required in a leader-follower system is that some transactions may require that they be propagated to follower nodes before they can be committed.

For the purposes of snapshot availability, while the validation hook is running for transaction T:

In cases where normal, local, validation of a transaction fails after a CID value has been allocated, an entry with zero length values for the schema and data columns is inserted into the journal table. In this case the validation hook is invoked with the corresponding zero length parameters, so that the empty transaction can be propagated to follower nodes.

7. Truncating the Journal Table

In order to avoid the journal table from growing indefinitely, old entries may be deleted from it - on a FIFO basis only - once it is possible that they will no longer be required for synchronization. The contents of the baseline table, always a single row, summarizes those journal entries already deleted from the journal table.

There is a C API to atomically remove rows from the journal table and update the baseline table:

    /*
    ** Truncate the journal table of database zDb (e.g. "main") so that the
    ** smallest CID value it contains is iMinCid.
    */
    int sqlite3_hct_journal_truncate(sqlite3 *db, i64 iMinCid);

8. Hash and Synchronization Related Functions

Synchronization between nodes is largely left to the application. The contents of a journal table may be read using ordinary SQL queries, and missing transactions applied to databases using the usual sqlite3_hct_journal_write() interface described above. This section describes the provided APIs for:

Further nodes on how these functions might be used to implement node synchronization may be found here.

Calculating Hashes

The hash value stored in the baseline table is SQLITE_HCT_JOURNAL_HASHSIZE (16) bytes in size. It may be calculated as follows:

The following function may be used to calculate a hash for a single journal table entry, based on the values of the "cid", "schema", "data" and "schemacid" columns:

    /*
    ** It is assumed that buffer pHash points to a buffer
    ** SQLITE_HCT_JOURNAL_HASHSIZE bytes in size. This function populates this
    ** buffer with a hash based on the remaining arguments.
    */
    void sqlite3_hct_journal_hashentry(
        void *pHash,              /* OUT: Hash of other arguments */
        i64 iCid,
        const char *zSchema,
        const void *pData, int nData,
        i64 iSchemaCid
    );

The following function may be used to calculate hash values compatible with those stored by the system in the "hash" column of the sqlite_hct_baseline table.

    /*
    ** Both arguments are assumed to point to SQLITE_HCT_JOURNAL_HASHSIZE
    ** byte buffers. This function calculates the XOR of the two buffers
    ** and overwrites the contents of buffer pHash with it.
    */
    void sqlite3_hct_journal_xor(void *pHash, const void *pData);

Dealing With Holes in the Journal

If a journal contains holes, the following are true:

The best way to deal with holes in the journal is to fill them in by calling sqlite3_hct_journal_write() with values corresponding to the missing transactions obtained from elsewhere in the system. That way no data is lost. This API provides an alternative for when the missing transactions cannot be found anywhere in the system.

    /*
    ** Rollback transactions added using sqlite3_hct_journal_write().
    */
    int sqlite3_hct_journal_rollback(sqlite3 *db, i64 iCid);

    /* 
    ** Special values that may be passed as second argument to
    ** sqlite3_hct_journal_rollback().
    */
    #define SQLITE_HCT_ROLLBACK_MAXIMUM   0
    #define SQLITE_HCT_ROLLBACK_PRESERVE -1

The second argument passed to sqlite3_hct_journal_rollback() must be either:

It is not possible to use this API to rollback further than the first hole in the journal. This is because hctree does not guarantee that the information required to do such a rollback is still present in the database file.

Example of why this is Necessary

This might seem dramatic - it involves discarding transactions after all - but is necessary under some circumstances. Suppose a failure in the system leaves a follower node with transactions 1, 2, 3 and 5, but not 4. In this state transaction 5 must be discarded, as it may depend on transaction 4 - without transaction 4, transactions 1, 2, 3 and 5 may not constitute a valid database state. This is true even if the application logic does not appear to demand rigorous consistency. If, for example:

    -- Initial database state
    CREATE TABLE t1(a INTEGER PRIMARY KEY, b TEXT UNIQUE);
    INSERT INTO t1 VALUES(101, 'abc');

    DELETE FROM t1 WHERE a=101;         -- transaction 4
    INSERT INTO t1 VALUES(102, 'abc');  -- transaction 5

In follower mode, sqlite3_hct_journal_write() may be used to apply transaction 5 to the db before transaction 4. If, following a failure in the system, transaction 4 were lost while transaction 5 were not, the database UNIQUE constraint would be violated.

9. Application Programming Notes

This section contains a description of a simple replicated database system that could be constructed using the APIs above, along with observations made while designing and testing the same. This is not (unfortunately) a comprehensive set of instructions for building a high-concurrency replicated database. It is intended only to illustrate the roles that the APIs described above are expected to play in such a system.

Normal Operation

Once the system is up and running, with one node elected as leader (and the db in LEADER mode) and all others operating as followers (with the db in FOLLOWER mode):

Leader Node Follower Node Leader DB Follower DB Pool of db connections for writing on leader, each with its own socket connection to the follower node, and its own follower connection replicating its htransactions
linewid=0.1

LEADER:   box height 1.5 width 1.5 radius 0.1
move
FOLLOWER: box same

text at LEADER.n + (0.0,0.1) "Leader Node"
text at FOLLOWER.n + (0.0,0.1) "Follower Node"

cylinder "Leader" "DB" height 0.7 width 0.5 at LEADER.c - (0.2,0)
cylinder "Follower" "DB" same at FOLLOWER.c + (0.2,0)

down
L1: box with ne at LEADER.ne - (0.1,0.2) height 0.2 width 0.2
move 0.1 ; L2: box same
move 0.1 ; L3: box same
move 0.1 ; L4: box same

down
F1: box with nw at FOLLOWER.nw - (-0.1,0.2) height 0.2 width 0.2
move 0.1 ; F2: box same
move 0.1 ; F3: box same
move 0.1 ; F4: box same

arrow <-> from L1.e to F1.w
arrow <-> from L2.e to F2.w
arrow <-> from L3.e to F3.w
SOCKET: arrow <-> from L4.e to F4.w

right
T1: text at LEADER.s - (0.2,0.3) "Pool of db" "connections for" "writing on leader,"
T2: text "each with its own" "socket connection" "to the follower node,"
T3: text "and its own" "follower connection" "replicating its" "htransactions"

BTOP1: line from L1.nw-(0.1,0.0) then go 0.1 sw
BBOT1: line from L4.sw-(0.1,0.0) then go 0.1 nw
BMID1: line from BTOP1.sw to BBOT1.nw
line from BMID1 to T1.n

BTOP2: line from F1.ne+(0.1,0.0) then go 0.1 se
BBOT2: line from F4.se+(0.1,0.0) then go 0.1 ne
BMID2: line from BTOP2.se to BBOT2.ne
line from BMID2 to T3.n

line from T2.n to SOCKET.c



Adding a Node

A new follower node may be added to the system at any point. Before it starts the new follower node must have some local database - an old version of the logical db.

Failure of Leader Node

When the current leader node fails or is shut down, the system must (somehow) elect a new leader node from the remaining followers. At this point all nodes in the system (including the new leader) are in FOLLOWER mode.

System Startup

System startup or restart of a system consisting of multiple nodes, all with potentially complementary or conflicting journals, may be a complicated thing. However, a simple approach could be to:

Of course, this simple approach could lead to connection errors if any follower nodes have transactions in their journals that the initial leader does not have. Various approaches could be developed to account for this.

10. Table Details

The journal table (sqlite_hct_journal):

Column Contents
cid The integer CID (Commit ID) value assigned to the transaction. CID values are assigned in contiguous incrementing order.
schema If the transaction made any modifications to the database schema, then this field contains an SQL script to recreate them. If the transaction did not modify the database schema at all, then this field is set to a string zero characters in length.
data This field contains a blob that encodes the changes made to table contents by the transaction in a key-value format - where keys are the PRIMARY KEY field values of affected tables, and the values are either the new row data for the row, or a tombstone marker signifying a delete. The exact format is described here.
schemacid This field contains the "cid" value (an integer) corresponding to the transaction that created the version of the schema that this transaction was executed. In other words, the cid of the transaction that most recently modified the schema.
hash Hash of fields "cid", "schema", "data" and "schemacid".
tid The integer TID (Transaction ID) value used internally for the transaction by the local hctree node. This is used internally and it not duplicated between leader and follower database nodes.
validcid This is used internally and is not copied between leader and follower database nodes.

The baseline table (sqlite_hct_baseline):

Column Contents
cid The CID of the last journal entry deleted. All journal entries with CID values between 1 and this value, contiguous, have been deleted from the journal table.
hash A hash of the "hash" fields for all deleted journal entries.
schema_version The schemacid value of the last journal entry deleted.

11. Data Format

This section describes the format used by the blobs in the "data" column of sqlite_hct_journal.

If the entry does not modify any table rows, then the data blob is zero bytes in size. Otherwise, it consists of an 8-byte big-endian CID value identifying the database snapshot against which the transaction was originally run, followed by a series of entries. Each entry begins with either 'T', or else an upper or lower-case 'I' or 'D'. The format of the rest of the entry depends on its type. As follows:

Character Type Format
'T' New table. Nul-terminated UTF-8 table name.
'i' Insert on table with IPK or no PK. A varint containing the rowid value. Followed by an SQLite format record containing the other record fields.
'I' Insert on table with explicit non-INTEGER PK. An SQLite format record.
'd' Delete by rowid. A varint containing the rowid to delete
'D' Delete by explicit non-INTEGER PK. An SQLite record containing the PK of the row to delete.