/ Check-in [6433d366]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Bring this branch into closer alignment with begin-concurrent.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | begin-concurrent-pnu
Files: files | file ages | folders
SHA3-256:6433d366175c5be586a0e11b540e76a2515466b358332c4356388f4c449ec0d7
User & Date: drh 2019-03-26 13:28:15
Context
2019-04-15
15:07
Merge the latest enhancements from trunk. check-in: 6caaaf21 user: drh tags: begin-concurrent-pnu
2019-03-26
13:28
Bring this branch into closer alignment with begin-concurrent. check-in: 6433d366 user: drh tags: begin-concurrent-pnu
12:16
Merge the latest trunk changes. check-in: 51e3e835 user: drh tags: begin-concurrent-pnu
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Added doc/begin_concurrent.md.























































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107

Begin Concurrent
================

## Overview

Usually, SQLite allows at most one writer to proceed concurrently. The
BEGIN CONCURRENT enhancement allows multiple writers to process write
transactions simultanously if the database is in "wal" or "wal2" mode,
although the system still serializes COMMIT commands.

When a write-transaction is opened with "BEGIN CONCURRENT", actually 
locking the database is deferred until a COMMIT is executed. This means
that any number of transactions started with BEGIN CONCURRENT may proceed
concurrently. The system uses optimistic page-level-locking to prevent
conflicting concurrent transactions from being committed.

When a BEGIN CONCURRENT transaction is committed, the system checks whether 
or not any of the database pages that the transaction has read have been
modified since the BEGIN CONCURRENT was opened. In other words - it asks 
if the transaction being committed operates on a different set of data than
all other concurrently executing transactions. If the answer is "yes, this
transaction did not read or modify any data modified by any concurrent
transaction", then the transaction is committed as normal. Otherwise, if the
transaction does conflict, it cannot be committed and an SQLITE_BUSY_SNAPSHOT
error is returned. At this point, all the client can do is ROLLBACK the
transaction.

If SQLITE_BUSY_SNAPSHOT is returned, messages are output via the sqlite3_log
mechanism indicating the page and table or index on which the conflict
occurred. This can be useful when optimizing concurrency.

## Application Programming Notes

In order to serialize COMMIT processing, SQLite takes a lock on the database
as part of each COMMIT command and releases it before returning. At most one
writer may hold this lock at any one time. If a writer cannot obtain the lock,
it uses SQLite's busy-handler to pause and retry for a while:

  <a href=https://www.sqlite.org/c3ref/busy_handler.html>
      https://www.sqlite.org/c3ref/busy_handler.html
  </a>

If there is significant contention for the writer lock, this mechanism can be
inefficient. In this case it is better for the application to use a mutex or
some other mechanism that supports blocking to ensure that at most one writer
is attempting to COMMIT a BEGIN CONCURRENT transaction at a time. This is
usually easier if all writers are part of the same operating system process.

If all database clients (readers and writers) are located in the same OS
process, and if that OS is a Unix variant, then it can be more efficient to
the built-in VFS "unix-excl" instead of the default "unix". This is because it
uses more efficient locking primitives.

The key to maximizing concurrency using BEGIN CONCURRENT is to ensure that
there are a large number of non-conflicting transactions. In SQLite, each
table and each index is stored as a separate b-tree, each of which is
distributed over a discrete set of database pages. This means that:

  * Two transactions that write to different sets of tables never 
    conflict, and that

  * Two transactions that write to the same tables or indexes only 
    conflict if the values of the keys (either primary keys or indexed 
    rows) are fairly close together. For example, given a large 
    table with the schema:

      <pre>     CREATE TABLE t1(a INTEGER PRIMARY KEY, b BLOB);</pre>

    writing two rows with adjacent values for "a" probably will cause a
    conflict (as the two keys are stored on the same page), but writing two
    rows with vastly different values for "a" will not (as the keys will likly
    be stored on different pages).

Note that, in SQLite, if values are not explicitly supplied for an INTEGER
PRIMARY KEY, as for example in:

>
     INSERT INTO t1(b) VALUES(&lt;blob-value>);

then monotonically increasing values are assigned automatically. This is
terrible for concurrency, as it all but ensures that all new rows are 
added to the same database page. In such situations, it is better to
explicitly assign random values to INTEGER PRIMARY KEY fields.

This problem also comes up for non-WITHOUT ROWID tables that do not have an
explicit INTEGER PRIMARY KEY column. In these cases each table has an implicit
INTEGER PRIMARY KEY column that is assigned increasing values, leading to the
same problem as omitting to assign a value to an explicit INTEGER PRIMARY KEY
column.

For both explicit and implicit INTEGER PRIMARY KEYs, it is possible to have
SQLite assign values at random (instead of the monotonically increasing
values) by writing a row with a rowid equal to the largest possible signed
64-bit integer to the table. For example:

     INSERT INTO t1(a) VALUES(9223372036854775807);

Applications should take care not to malfunction due to the presence of such
rows.

The nature of some types of indexes, for example indexes on timestamp fields,
can also cause problems (as concurrent transactions may assign similar
timestamps that will be stored on the same db page to new records). In these
cases the database schema may need to be rethought to increase the concurrency
provided by page-level-locking.

Changes to src/btree.c.

8020
8021
8022
8023
8024
8025
8026

8027
8028
8029
8030
8031
8032
8033
8034
8035
8036
8037
8038
8039
8040
8041
  pgno = get4byte(pRight);
  while( 1 ){
    rc = getAndInitPage(pBt, pgno, &apOld[i], 0, 0);
    if( rc ){
      memset(apOld, 0, (i+1)*sizeof(MemPage*));
      goto balance_cleanup;
    }

    if( apOld[i]->nFree<0 ){
      rc = btreeComputeFreeSpace(apOld[i]);
      if( rc ){
        memset(apOld, 0, (i)*sizeof(MemPage*));
        goto balance_cleanup;
      }
    }
    setMempageRoot(apOld[i], pgnoRoot);
    if( (i--)==0 ) break;

    if( pParent->nOverflow && i+nxDiv==pParent->aiOvfl[0] ){
      apDiv[i] = pParent->apOvfl[0];
      pgno = get4byte(apDiv[i]);
      szNew[i] = pParent->xCellSize(pParent, apDiv[i]);
      pParent->nOverflow = 0;







>







<







8020
8021
8022
8023
8024
8025
8026
8027
8028
8029
8030
8031
8032
8033
8034

8035
8036
8037
8038
8039
8040
8041
  pgno = get4byte(pRight);
  while( 1 ){
    rc = getAndInitPage(pBt, pgno, &apOld[i], 0, 0);
    if( rc ){
      memset(apOld, 0, (i+1)*sizeof(MemPage*));
      goto balance_cleanup;
    }
    setMempageRoot(apOld[i], pgnoRoot);
    if( apOld[i]->nFree<0 ){
      rc = btreeComputeFreeSpace(apOld[i]);
      if( rc ){
        memset(apOld, 0, (i)*sizeof(MemPage*));
        goto balance_cleanup;
      }
    }

    if( (i--)==0 ) break;

    if( pParent->nOverflow && i+nxDiv==pParent->aiOvfl[0] ){
      apDiv[i] = pParent->apOvfl[0];
      pgno = get4byte(apDiv[i]);
      szNew[i] = pParent->xCellSize(pParent, apDiv[i]);
      pParent->nOverflow = 0;

Changes to src/pager.c.

3212
3213
3214
3215
3216
3217
3218
3219






3220
3221
3222
3223
3224
3225
3226
  ** been written (but not committed) to the log file, do one of the 
  ** following:
  **
  **   + Discard the cached page (if refcount==0), or
  **   + Reload page content from the database (if refcount>0).
  */
  pPager->dbSize = pPager->dbOrigSize;
  rc = sqlite3WalUndo(pPager->pWal, pagerUndoCallback, (void *)pPager);






  pList = sqlite3PcacheDirtyList(pPager->pPCache);

#ifndef SQLITE_OMIT_CONCURRENT
  /* If this is an CONCURRENT transaction, then page 1 must be reread from 
  ** the db file, even if it is not dirty. This is because the b-tree layer 
  ** may have already zeroed the nFree and iTrunk header fields.  */
  if( rc==SQLITE_OK && (pList==0 || pList->pgno!=1) && pPager->pAllRead ){







|
>
>
>
>
>
>







3212
3213
3214
3215
3216
3217
3218
3219
3220
3221
3222
3223
3224
3225
3226
3227
3228
3229
3230
3231
3232
  ** been written (but not committed) to the log file, do one of the 
  ** following:
  **
  **   + Discard the cached page (if refcount==0), or
  **   + Reload page content from the database (if refcount>0).
  */
  pPager->dbSize = pPager->dbOrigSize;
  rc = sqlite3WalUndo(pPager->pWal, pagerUndoCallback, (void *)pPager,
#ifdef SQLITE_OMIT_CONCURRENT
      0
#else
      pPager->pAllRead!=0
#endif
  );
  pList = sqlite3PcacheDirtyList(pPager->pPCache);

#ifndef SQLITE_OMIT_CONCURRENT
  /* If this is an CONCURRENT transaction, then page 1 must be reread from 
  ** the db file, even if it is not dirty. This is because the b-tree layer 
  ** may have already zeroed the nFree and iTrunk header fields.  */
  if( rc==SQLITE_OK && (pList==0 || pList->pgno!=1) && pPager->pAllRead ){

Changes to src/sqliteInt.h.

1554
1555
1556
1557
1558
1559
1560

1561
1562
1563
1564
1565
1566
1567
1568
1569
#define SQLITE_Fts3Tokenizer  0x00400000  /* Enable fts3_tokenizer(2) */
#define SQLITE_EnableQPSG     0x00800000  /* Query Planner Stability Guarantee*/
#define SQLITE_TriggerEQP     0x01000000  /* Show trigger EXPLAIN QUERY PLAN */
#define SQLITE_ResetDatabase  0x02000000  /* Reset the database */
#define SQLITE_LegacyAlter    0x04000000  /* Legacy ALTER TABLE behaviour */
#define SQLITE_NoSchemaError  0x08000000  /* Do not report schema parse errors*/
#define SQLITE_Defensive      0x10000000  /* Input SQL is likely hostile */


#define SQLITE_NoopUpdate     0x01000000  /* UPDATE operations are no-ops */
/* Flags used only if debugging */
#define HI(X)  ((u64)(X)<<32)
#ifdef SQLITE_DEBUG
#define SQLITE_SqlTrace       HI(0x0001)  /* Debug print SQL as it executes */
#define SQLITE_VdbeListing    HI(0x0002)  /* Debug listings of VDBE progs */
#define SQLITE_VdbeTrace      HI(0x0004)  /* True to trace VDBE execution */
#define SQLITE_VdbeAddopTrace HI(0x0008)  /* Trace sqlite3VdbeAddOp() calls */







>

<







1554
1555
1556
1557
1558
1559
1560
1561
1562

1563
1564
1565
1566
1567
1568
1569
#define SQLITE_Fts3Tokenizer  0x00400000  /* Enable fts3_tokenizer(2) */
#define SQLITE_EnableQPSG     0x00800000  /* Query Planner Stability Guarantee*/
#define SQLITE_TriggerEQP     0x01000000  /* Show trigger EXPLAIN QUERY PLAN */
#define SQLITE_ResetDatabase  0x02000000  /* Reset the database */
#define SQLITE_LegacyAlter    0x04000000  /* Legacy ALTER TABLE behaviour */
#define SQLITE_NoSchemaError  0x08000000  /* Do not report schema parse errors*/
#define SQLITE_Defensive      0x10000000  /* Input SQL is likely hostile */
#define SQLITE_NoopUpdate     0x20000000  /* UPDATE operations are no-ops */


/* Flags used only if debugging */
#define HI(X)  ((u64)(X)<<32)
#ifdef SQLITE_DEBUG
#define SQLITE_SqlTrace       HI(0x0001)  /* Debug print SQL as it executes */
#define SQLITE_VdbeListing    HI(0x0002)  /* Debug listings of VDBE progs */
#define SQLITE_VdbeTrace      HI(0x0004)  /* True to trace VDBE execution */
#define SQLITE_VdbeAddopTrace HI(0x0008)  /* Trace sqlite3VdbeAddOp() calls */

Changes to src/wal.c.

3254
3255
3256
3257
3258
3259
3260


3261



3262
3263
3264
3265
3266
3267
3268
3269
3270







3271
3272
3273
3274
3275
3276
3277
** to the WAL since the start of the transaction. If the callback returns
** other than SQLITE_OK, it is not invoked again and the error code is
** returned to the caller.
**
** Otherwise, if the callback function does not return an error, this
** function returns SQLITE_OK.
*/


int sqlite3WalUndo(Wal *pWal, int (*xUndo)(void *, Pgno), void *pUndoCtx){



  int rc = SQLITE_OK;
  if( pWal->writeLock ){
    Pgno iMax = pWal->hdr.mxFrame;
    Pgno iFrame;
  
    /* Restore the clients cache of the wal-index header to the state it
    ** was in before the client began writing to the database. 
    */
    memcpy(&pWal->hdr, (void *)walIndexHdr(pWal), sizeof(WalIndexHdr));








    for(iFrame=pWal->hdr.mxFrame+1; 
        ALWAYS(rc==SQLITE_OK) && iFrame<=iMax; 
        iFrame++
    ){
      /* This call cannot fail. Unless the page for which the page number
      ** is passed as the second argument is (a) in the cache and 







>
>
|
>
>
>









>
>
>
>
>
>
>







3254
3255
3256
3257
3258
3259
3260
3261
3262
3263
3264
3265
3266
3267
3268
3269
3270
3271
3272
3273
3274
3275
3276
3277
3278
3279
3280
3281
3282
3283
3284
3285
3286
3287
3288
3289
** to the WAL since the start of the transaction. If the callback returns
** other than SQLITE_OK, it is not invoked again and the error code is
** returned to the caller.
**
** Otherwise, if the callback function does not return an error, this
** function returns SQLITE_OK.
*/
int sqlite3WalUndo(
  Wal *pWal, 
  int (*xUndo)(void *, Pgno), 
  void *pUndoCtx,
  int bConcurrent                 /* True if this is a CONCURRENT transaction */
){
  int rc = SQLITE_OK;
  if( pWal->writeLock ){
    Pgno iMax = pWal->hdr.mxFrame;
    Pgno iFrame;
  
    /* Restore the clients cache of the wal-index header to the state it
    ** was in before the client began writing to the database. 
    */
    memcpy(&pWal->hdr, (void *)walIndexHdr(pWal), sizeof(WalIndexHdr));
#ifndef SQLITE_OMIT_CONCURRENT
    if( bConcurrent ){
      pWal->hdr.aCksum[0]++;
    }
#else
    UNUSED_PARAMETER(bConcurrent);
#endif

    for(iFrame=pWal->hdr.mxFrame+1; 
        ALWAYS(rc==SQLITE_OK) && iFrame<=iMax; 
        iFrame++
    ){
      /* This call cannot fail. Unless the page for which the page number
      ** is passed as the second argument is (a) in the cache and 

Changes to src/wal.h.

30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
..
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# define sqlite3WalLimit(x,y)
# define sqlite3WalClose(v,w,x,y,z)              0
# define sqlite3WalBeginReadTransaction(y,z)     0
# define sqlite3WalEndReadTransaction(z)
# define sqlite3WalDbsize(y)                     0
# define sqlite3WalBeginWriteTransaction(y)      0
# define sqlite3WalEndWriteTransaction(x)        0
# define sqlite3WalUndo(x,y,z)                   0
# define sqlite3WalSavepoint(y,z)
# define sqlite3WalSavepointUndo(y,z)            0
# define sqlite3WalFrames(u,v,w,x,y,z)           0
# define sqlite3WalCheckpoint(q,r,s,t,u,v,w,x,y,z) 0
# define sqlite3WalCallback(z)                   0
# define sqlite3WalExclusiveMode(y,z)            0
# define sqlite3WalHeapMemory(z)                 0
................................................................................
Pgno sqlite3WalDbsize(Wal *pWal);

/* Obtain or release the WRITER lock. */
int sqlite3WalBeginWriteTransaction(Wal *pWal);
int sqlite3WalEndWriteTransaction(Wal *pWal);

/* Undo any frames written (but not committed) to the log */
int sqlite3WalUndo(Wal *pWal, int (*xUndo)(void *, Pgno), void *pUndoCtx);

/* Return an integer that records the current (uncommitted) write
** position in the WAL */
void sqlite3WalSavepoint(Wal *pWal, u32 *aWalData);

/* Move the write position of the WAL back to iFrame.  Called in
** response to a ROLLBACK TO command. */







|







 







|







30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
..
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# define sqlite3WalLimit(x,y)
# define sqlite3WalClose(v,w,x,y,z)              0
# define sqlite3WalBeginReadTransaction(y,z)     0
# define sqlite3WalEndReadTransaction(z)
# define sqlite3WalDbsize(y)                     0
# define sqlite3WalBeginWriteTransaction(y)      0
# define sqlite3WalEndWriteTransaction(x)        0
# define sqlite3WalUndo(w,x,y,z)                 0
# define sqlite3WalSavepoint(y,z)
# define sqlite3WalSavepointUndo(y,z)            0
# define sqlite3WalFrames(u,v,w,x,y,z)           0
# define sqlite3WalCheckpoint(q,r,s,t,u,v,w,x,y,z) 0
# define sqlite3WalCallback(z)                   0
# define sqlite3WalExclusiveMode(y,z)            0
# define sqlite3WalHeapMemory(z)                 0
................................................................................
Pgno sqlite3WalDbsize(Wal *pWal);

/* Obtain or release the WRITER lock. */
int sqlite3WalBeginWriteTransaction(Wal *pWal);
int sqlite3WalEndWriteTransaction(Wal *pWal);

/* Undo any frames written (but not committed) to the log */
int sqlite3WalUndo(Wal *pWal, int (*xUndo)(void *, Pgno), void *pUndoCtx, int);

/* Return an integer that records the current (uncommitted) write
** position in the WAL */
void sqlite3WalSavepoint(Wal *pWal, u32 *aWalData);

/* Move the write position of the WAL back to iFrame.  Called in
** response to a ROLLBACK TO command. */

Changes to test/concfault.test.

78
79
80
81
82
83
84
































85





86


  }
} -test {
  faultsim_test_result {0 {}} 
  catchsql { ROLLBACK }
  faultsim_integrity_check
}

































finish_test















>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|
>
>
>
>
>
|
>
>
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
  }
} -test {
  faultsim_test_result {0 {}} 
  catchsql { ROLLBACK }
  faultsim_integrity_check
}

#-------------------------------------------------------------------------
reset_db

do_execsql_test 2.0 {
  PRAGMA auto_vacuum = 0;
  PRAGMA journal_mode = wal;
  CREATE TABLE t1(a PRIMARY KEY, b);
  CREATE TABLE t2(a PRIMARY KEY, b);
  INSERT INTO t1 VALUES(randomblob(1000), randomblob(100));
  INSERT INTO t1 SELECT randomblob(1000), randomblob(1000) FROM t1;
  INSERT INTO t1 SELECT randomblob(1000), randomblob(1000) FROM t1;
  INSERT INTO t1 SELECT randomblob(1000), randomblob(1000) FROM t1;
  INSERT INTO t1 SELECT randomblob(1000), randomblob(1000) FROM t1;
  DELETE FROM t1 WHERE rowid%2;
} {wal}

faultsim_save_and_close
do_faultsim_test 1 -prep {
  faultsim_restore_and_reopen
  execsql {
    SELECT * FROM t1;
    BEGIN CONCURRENT;
      INSERT INTO t2 VALUES(1, 2);
  }
  sqlite3 db2 test.db
  execsql {
    PRAGMA journal_size_limit = 10000;
    INSERT INTO t1 VALUES(randomblob(1000), randomblob(1000));
  } db2
  db2 close
} -body {
  execsql { COMMIT }
} -test {
  faultsim_test_result {0 {}} 
  catchsql { ROLLBACK }
  set res [catchsql { SELECT count(*) FROM t1 }]
  if {$res!="0 9"} { error "expected {0 9} got {$res}" }
  faultsim_integrity_check
}

finish_test

Changes to test/permutations.test.

85
86
87
88
89
90
91

92
93
94
95
96
97
98
#   $allquicktests
#
set alltests [list]
foreach f [glob $testdir/*.test] { lappend alltests [file tail $f] }
foreach f [glob -nocomplain            \
    $testdir/../ext/rtree/*.test       \
    $testdir/../ext/fts5/test/*.test   \

    $testdir/../ext/lsm1/test/*.test   \
] {
  lappend alltests $f 
}
foreach f [glob -nocomplain $testdir/../ext/session/*.test] { 
  lappend alltests $f 
}







>







85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
#   $allquicktests
#
set alltests [list]
foreach f [glob $testdir/*.test] { lappend alltests [file tail $f] }
foreach f [glob -nocomplain            \
    $testdir/../ext/rtree/*.test       \
    $testdir/../ext/fts5/test/*.test   \
    $testdir/../ext/expert/*.test      \
    $testdir/../ext/lsm1/test/*.test   \
] {
  lappend alltests $f 
}
foreach f [glob -nocomplain $testdir/../ext/session/*.test] { 
  lappend alltests $f 
}