SQLite

Check-in [ed817fc893]
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add two text files containing pager design notes to the doc/ subfolder.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: ed817fc893e7162ae0ff4022591f7e9e3b81d622
User & Date: drh 2010-05-06 11:55:57.000
Context
2010-05-06
11:56
Remove the noop-mutex implementations of mutex_held() and mutex_notheld() since they are both unreachable. (check-in: 6767b62a9a user: drh tags: trunk)
11:55
Add two text files containing pager design notes to the doc/ subfolder. (check-in: ed817fc893 user: drh tags: trunk)
11:32
Add test cases to test the libraries handling of corrupt wal-index headers. (check-in: 9465b267d4 user: dan tags: trunk)
Changes
Unified Diff Ignore Whitespace Patch
Added doc/pager-invariants.txt.
























































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
 *** Throughout this document, a page is deemed to have been synced
     automatically as soon as it is written when PRAGMA synchronous=OFF.
     Otherwise, the page is not synced until the xSync method of the VFS
     is called successfully on the file containing the page.

 *** Definition:  A page of the database file is said to be "overwriteable" if
     one or more of the following are true about the page:
 
     (a)  The original content of the page as it was at the beginning of
          the transaction has been written into the rollback journal and
          synced.
 
     (b)  The page was a freelist leaf page at the start of the transaction.
 
     (c)  The page number is greater than the largest page that existed in
          the database file at the start of the transaction.
 
 (1) A page of the database file is never overwritten unless one of the
     following are true:
 
     (a) The page and all other pages on the same sector are overwriteable.
 
     (b) The atomic page write optimization is enabled, and the entire
         transaction other than the update of the transaction sequence
         number consists of a single page change.
 
 (2) The content of a page written into the rollback journal exactly matches
     both the content in the database when the rollback journal was written
     and the content in the database at the beginning of the current
     transaction.
 
 (3) Writes to the database file are an integer multiple of the page size
     in length and are aligned to a page boundary.
 
 (4) Reads from the database file are either aligned on a page boundary and
     an integer multiple of the page size in length or are taken from the
     first 100 bytes of the database file.
 
 (5) All writes to the database file are synced prior to the rollback journal
     being deleted, truncated, or zeroed.
 
 (6) If a master journal file is used, then all writes to the database file
     are synced prior to the master journal being deleted.
 
 *** Definition: Two databases (or the same database at two points it time)
     are said to be "logically equivalent" if they give the same answer to
     all queries.  Note in particular the the content of freelist leaf
     pages can be changed arbitarily without effecting the logical equivalence
     of the database.
 
 (7) At any time, if any subset, including the empty set and the total set,
     of the unsynced changes to a rollback journal are removed and the 
     journal is rolled back, the resulting database file will be logical
     equivalent to the database file at the beginning of the transaction.
 
 (8) When a transaction is rolled back, the xTruncate method of the VFS
     is called to restore the database file to the same size it was at
     the beginning of the transaction.  (In some VFSes, the xTruncate
     method is a no-op, but that does not change the fact the SQLite will
     invoke it.)
 
 (9) Whenever the database file is modified, at least one bit in the range
     of bytes from 24 through 39 inclusive will be changed prior to releasing
     the EXCLUSIVE lock.

(10) The pattern of bits in bytes 24 through 39 shall not repeat in less
     than one billion transactions.

(11) A database file is well-formed at the beginning and at the conclusion
     of every transaction.

(12) An EXCLUSIVE lock must be held on the database file before making
     any changes to the database file.

(13) A SHARED lock must be held on the database file before reading any
     content out of the database file.
Added doc/vfs-shm.txt.


























































































































































































































































>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
The 5 states of an historical rollback lock as implemented by the
xLock, xUnlock, and xCheckReservedLock methods of the sqlite3_io_methods
objec are:

   UNLOCKED
   SHARED
   RESERVED
   PENDING
   EXCLUSIVE

The wal-index file has a similar locking hierarchy implemented using
the xShmLock method of the sqlite3_vfs object, but with 7
states.  Each connection to a wal-index file must be in one of
the following 7 states:

   UNLOCKED
   READ
   READ_FULL
   WRITE
   PENDING
   CHECKPOINT
   RECOVER

These roughly correspond to the 5 states of a rollback lock except
that SHARED is split out into 2 states: READ and READ_FULL and
there is an extra RECOVER state used for wal-index reconstruction.

The meanings of the various wal-index locking states is as follows:

   UNLOCKED    - The wal-index is not in use.

   READ        - Some prefix of the wal-index is being read. Additional
                 wal-index information can be appended at any time.  The
                 newly appended content will be ignored by the holder of
                 the READ lock.

   READ_FULL   - The entire wal-index is being read.  No new information
                 can be added to the wal-index.  The holder of a READ_FULL
                 lock promises never to read pages from the database file
                 that are available anywhere in the wal-index.

   WRITE       - It is OK to append to the wal-index file and to adjust
                 the header to indicate the new "last valid frame".

   PENDING     - Waiting on all READ locks to clear so that a
                 CHECKPOINT lock can be acquired.

   CHECKPOINT  - It is OK to write any WAL data into the database file
                 and zero the last valid frame field of the wal-index
                 header.  The wal-index file itself may not be changed
                 other than to zero the last valid frame field in the
                 header.

   RECOVER     - Held during wal-index recovery.  Used to prevent a
                 race if multiple clients try to recover a wal-index at
                 the same time.
   

A particular lock manager implementation may coalesce one or more of 
the wal-index locking states, though with a reduction in concurrency.
For example, an implemention might implement only exclusive locking,
in which case all states would be equivalent to CHECKPOINT, meaning that 
only one reader or one writer or one checkpointer could be active at a 
time.  Or, an implementation might combine READ and READ_FULL into 
a single state equivalent to READ, meaning that a writer could
coexist with a reader, but no reader or writers could coexist with a
checkpointer.

The lock manager must obey the following rules:

(1)  A READ cannot coexist with CHECKPOINT.
(2)  A READ_FULL cannot coexist with WRITE.
(3)  None of WRITE, PENDING, CHECKPOINT, or RECOVER can coexist.

The SQLite core will obey the next set of rules.  These rules are
assertions on the behavior of the SQLite core which might be verified
during testing using an instrumented lock manager.

(5)  No part of the wal-index will be read without holding either some
     kind of SHM lock or an EXCLUSIVE lock on the original database.
(6)  A holder of a READ_FULL will never read any page of the database
     file that is contained anywhere in the wal-index.
(7)  No part of the wal-index other than the header will be written nor
     will the size of the wal-index grow without holding a WRITE.
(8)  The wal-index header will not be written without holding one of
     WRITE, CHECKPOINT, or RECOVER.
(9)  A CHECKPOINT or RECOVER must be held in order to reset the last valid
     frame counter in the header of the wal-index back to zero.
(10) A WRITE can only increase the last valid frame pointer in the header.

The SQLite core will only ever send requests for UNLOCK, READ, WRITE,
CHECKPOINT, or RECOVER to the lock manager.   The SQLite core will never
request a READ_FULL or PENDING lock though the lock manager may deliver
those locking states in response to READ and CHECKPOINT requests,
respectively, if and only if the requested READ or CHECKPOINT cannot
be delivered.

The following are the allowed lock transitions:

       Original-State     Request         New-State
       --------------     ----------      ----------
(11a)  UNLOCK             READ            READ
(11b)  UNLOCK             READ            READ_FULL
(11c)  UNLOCK             CHECKPOINT      PENDING
(11d)  UNLOCK             CHECKPOINT      CHECKPOINT
(11e)  READ               UNLOCK          UNLOCK
(11f)  READ               WRITE           WRITE
(11g)  READ               RECOVER         RECOVER
(11h)  READ_FULL          UNLOCK          UNLOCK
(11i)  READ_FULL          WRITE           WRITE
(11j)  READ_FULL          RECOVER         RECOVER
(11k)  WRITE              READ            READ
(11l)  PENDING            UNLOCK          UNLOCK
(11m)  PENDING            CHECKPOINT      CHECKPOINT
(11n)  CHECKPOINT         UNLOCK          UNLOCK
(11o)  CHECKPOINT         RECOVER         RECOVER
(11p)  RECOVER            READ            READ
(11q)  RECOVER            CHECKPOINT      CHECKPOINT

These 17 transitions are all that needs to be supported.  The lock
manager implementation can assert that fact.  The other 25 possible
transitions among the 7 locking states will never occur.

The rules above are sufficient for correctness.  For maximum concurrency,
the following additional considerations apply: