SQLite4
Check-in [ec9a805164]
Not logged in

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add other notes to lsm.wiki.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: ec9a80516479bb7c2ef1c58b313b59645f1d5dee
User & Date: dan 2013-02-21 19:56:37
Context
2013-02-22
20:23
Fix over-length source code lines in sqliteInt.h. check-in: 035df9b1de user: drh tags: trunk
2013-02-21
19:56
Add other notes to lsm.wiki. check-in: ec9a805164 user: dan tags: trunk
17:53
Add a summary of locks to lsm.wiki. check-in: 3987cb831e user: dan tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to www/lsm.wiki.

1
2
3
4


5
6
7
8


9

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
..
71
72
73
74
75
76
77
78
79
80

81
82


83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
...
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158

159

160
161

162
163













164
165
166





167
168
169
170












171
172

173
174
175
176
177
178






179



















180
181
182
183
184




185
186
187
188
189
190



191
192
193
194










195
196
197




















198





199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
...
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
...
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
...
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
...
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
...
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
...
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
...
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
...
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
...
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
...
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
...
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
...
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
...
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
...
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878

<title>LSM Design Overview</title>
<nowiki>




<div id=start_of_toc></div>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#summary style=text-decoration:none>1. Summary </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#data_structures style=text-decoration:none>2. Data Structures </a><br>


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#locks style=text-decoration:none>2.1. Locks</a><br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#database_file style=text-decoration:none>2.2. Database file</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#sorted_runs style=text-decoration:none>2.2.1. Sorted Runs</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#levels style=text-decoration:none>2.2.2. Levels</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#snapshots style=text-decoration:none>2.2.3. Snapshots</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#in-memory_tree style=text-decoration:none>2.3. In-Memory Tree</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#memory_allocation style=text-decoration:none>2.3.1. Memory Allocation</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#header_fields style=text-decoration:none>2.3.2. Header Fields</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#other_shared-memory_fields style=text-decoration:none>2.4. Other Shared-Memory Fields</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#log_file style=text-decoration:none>2.5. Log file</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#database_recovery_and_shutdown style=text-decoration:none>3. Database Recovery and Shutdown</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#read-write_clients style=text-decoration:none>3.1. Read-write clients</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#database_operations style=text-decoration:none>4. Database Operations </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#reading style=text-decoration:none>4.1. Reading</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#writing style=text-decoration:none>4.2. Writing</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#flushing_the_in-memory_tree_to_disk style=text-decoration:none>4.2.1. Flushing the in-memory tree to disk</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#shared-memory_management style=text-decoration:none>4.2.2. Shared-memory management</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#log_file_management style=text-decoration:none>4.2.3. Log file management</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#working style=text-decoration:none>4.3. Working</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#free-block_list_management style=text-decoration:none>4.3.1. Free-block list management</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#checkpoint_operations style=text-decoration:none>4.4. Checkpoint Operations</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#scheduling_policies style=text-decoration:none>5. Scheduling Policies</a><br>

<div id=end_of_toc></div>

<h1 id=summary>1. Summary </h1>

The LSM embedded database software stores data in three distinct data
structures:
................................................................................
<p>
When an application writes to the database, the new data is written to the
in-memory tree. Once the in-memory tree has grown large enough, its contents
are written into the database file as a new sorted run. To reduce the number
of sorted runs in the database file, chronologically adjacent sorted runs 
may be merged together into a single run, either automatically or on demand.

<h1 id=locking>2. Locks and Locking Protocols</h1> <p>
Read/write (shared/exclusive) file locks are used to control concurrent 
access. LSM uses the following file-locks:


<table>


  <tr><td valign=top>DMS1
      <td><p style=margin-top:0>
          This locking byte is used to serialize all connection and
          disconnection operations performed by read-write database
          connections. An EXCLUSIVE lock is taken for the duration of all such
          operations.

          <p>Additionally, read-only connections take a SHARED lock on this
          locking byte while attempting to connect to a database. This ensures
          that a read-only connection does not attempt to connect to the
          database while a read-write clients connection or disconnection
          operation is ongoing.

  <tr><td valign=top>DMS2
      <td><p style=margin-top:0>
          Read-write connections hold a SHARED lock on this locking byte for
          as long as they are connected to the database.

  <tr><td valign=top>DMS3
      <td><p style=margin-top:0>
          Read-only connections hold a SHARED lock on this locking byte for
          as long as they are connected to the database.

  <tr><td valign=top>RWCLIENT(n)
      <td><p style=margin-top:0>
          There are a total of 16 RWCLIENT locking bytes. After a read-write
          client connects to the database it attempts to find a free RWCLIENT
          locking slot to take an EXCLUSIVE lock on. If it cannot find one,
          this is not an error. If it can, then the lock is held for as long 
          as the read-write client is connected to the database for.

          <p>The sole purpose of these locks is that they allow a read-only
          client to detect whether or not there exists at least one read-write
................................................................................
          one or more connected read-write clients but none of them hold a
          RWCLIENT lock. This is not important - if a read-only client fails
          to detect that the system has read-write clients it may be less
          efficient, but will not malfunction.

  <tr><td valign=top>WRITER
      <td><p style=margin-top:0>
          A database client holds an EXCLUSIVE lock on this locking byte
          while writing data to the database. Outside of recovery, only clients
          holding this lock may modify the contents of the in-memory b-tree.
          Holding this lock is synonymous with having an open write transaction
          on the database.

  <tr><td valign=top>WORKER
      <td><p style=margin-top:0>
          A database client holds an EXCLUSIVE lock on this locking byte
          while performing database work (writing data into the body of the
          database file).

  <tr><td valign=top>CHECKPOINTER&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
      <td><p style=margin-top:0>
          A database client holds an EXCLUSIVE lock on this locking byte
          while performing a checkpoint (syncing the database file and 
          writing to the database header).

  <tr><td valign=top>ROTRANS
      <td><p style=margin-top:0>
          A read-only database client holds a SHARED lock on this locking byte
          while reading from a non-live database system.

  <tr><td valign=top>READER(n)
      <td><p style=margin-top:0>
          There are a total of 6 READER locking bytes. Unless it is a read-only
          client reading from a non-live database, a client holds a SHARED lock
          on one of these while it has an open read transaction.  Each READER
          lock is associated with a pair of id values identifying the regions
          of the in-memory tree and database file that may be read by clients
          holding such SHARED locks.
</table>





<h1 id=data_structures>2. Data Structures </h1>


<h2 id=locks>2.1. Locks</h2>













<p>
Read/write (shared/exclusive) file locks are used to control concurrent 
access. LSM uses the following file-locks:






<ul>
  <li> <p>The <b>DMS1</b>, <b>DMS2</b> locking regions. These are used to
       implement the "dead-man-switch" mechanism copied from SQLite's WAL












       mode for safely connecting to and disconnecting from a database. 
       See "Database Recovery and Shutdown" below.


  <li> <p>Several (say 3) <b>READER</b> locking regions. Database clients 
       hold a SHARED lock one of the READER locking regions while reading the
       database. As in SQLite WAL mode, each reader lock is paired with a 
       value indicating the version of the database that the client process 
       is using. The value consists of two fields:


























  <ul>
    <li> A 64-bit snapshot id. This identifies the version of the database file
         that the reader is accessing.
    <li> A 32-bit shared-memory chunk id. This identifies the version of the
         in-memory tree the reader is reading.




  </ul>

  <li> <p>The <b>WRITER</b> locking region. A database client holds an 
       EXCLUSIVE lock on this locking region while writing data to the 
       database. Outside of recovery, only clients holding this lock may
       modify the contents of the in-memory b-tree.




  <li> <p>The <b>WORKER</b> lock. A database client holds an EXCLUSIVE lock
       on this locking region while writing data into the body of the 
       database file.











  <li> <p>The <b>CHECKPOINTER</b> lock. A database client holds an EXCLUSIVE
       lock on this locking region while writing to the database file header.




















</ul>






<p>
In the following sections, "the WRITER lock", refers to an exclusive lock
on the WRITER locking region. For example "holding the WRITER lock" is
equivalent to "holding an exclusive lock on the WRITER locking region".
Similar interpretations apply to "the WORKER lock" and "the CHECKPOINTER 
lock".

<h2 id=database_file>2.2. Database file</h2>

<p>
This section summarizes the contents of the database file informally. A 
detailed description is found in the header comments for source code files 
<a href="../src/lsm_file.c">lsm_file.c</a> (blocks, pages etc.), 
<a href="../src/lsm_sorted.c">lsm_sorted.c</a> (sorted run format) and 
<a href="../src/lsm_ckpt.c">lsm_ckpt.c</a> (database snapshot format). 
................................................................................

<p>
As with an SQLite database file, each page in the database may be addressed 
by its 32-bit page number. This means the maximum database size is roughly
(pgsz * 2^32) bytes. The first and last pages in each block are 4 bytes
smaller than the others. This is to make room for a single page-number.

<h3 id=sorted_runs>2.2.1. Sorted Runs</h3>

<p>
A single sorted run is spread across one or more database pages (each page
is a part of at most one sorted run). Given the page number of a page in a
sorted run the following statements are true:

<ul>
................................................................................
In other words, given the page numbers of the first and last pages of a 
sorted run and the page number of the root page for the embedded b-tree,
it is possible to traverse the entire run in either direction or query for
arbitrary values.

<p><span style="color:red"> TODO: Embedded pointers.  </span>

<h3 id=levels>2.2.2. Levels</h3>

<p>
Each sorted run is assigned to a "level". Normally, a level consists of a
single sorted run. However, a level may also consist of a set of sorted runs
being incrementally merged into a single run.

<p>
................................................................................
time for all entries.






<h3 id=snapshots>2.2.3. Snapshots</h3>

<p>
Each meta page may contain a database <b>snapshot</b>. A snapshot contains all 
the information required to interpret the remainder of the database file (the
sorted runs and free space). Specifically, it contains:

<ul>
................................................................................
       Recovery and Shutdown" below).
</ul>

<p>
A more detailed description is available in the header comments in
source code file <a href="../src/lsm_ckpt.c">lsm_ckpt.c</a>

<h2 id=in-memory_tree>2.3. In-Memory Tree</h2>

<p>
The in-memory tree is an append-only b-tree of order 4 (each node may have 
up to 4 children), which is more or less equivalent to a red-black tree. 
An append-only tree is convenient, as it naturally supports the 
single-writer/many-readers MVCC concurrency model. 

<p>
The implementation includes some optimizations to reduce the number of 
interior nodes that are updated when a leaf node is written that are not 
described here. See header comments in source code file 
<a href=../src/lsm_tree.c>lsm_tree.c</a> for details.

<h3 id=memory_allocation>2.3.1. Memory Allocation</h3>

<p>
More than one in-memory 
tree may exist in shared-memory at any time. For example in the following 
scenario:

<ol>
................................................................................
but the values that connect the linked list together are not. The writer 
that detects the failure must scan the entire shared-memory region to 
reconstruct the linked list. Any sequence ids assigned by the failed 
writer are reverted (perhaps not to their original values, but to values
that put them at the start of the linked list - before those chunks that
may still be in use by existing readers).

<h3 id=header_fields>2.3.2. Header Fields</h3>
<p>
As well as the in-memory tree data, the following fixed-size fields 
stored in well-known locations in shared-memory are part of the in-memory
tree. Like the in-memory tree data, outside of recovery these fields are only
ever written to by clients holding the WRITER lock.

<ul>
................................................................................
       transaction and cleared after that transaction is successfully 
       concluded - the "writer flag". This is used to detect failures that
       occur mid-transaction. It is only ever read (or written) by clients
       that hold the WRITER lock.
</ul>


<h2 id=other_shared-memory_fields>2.4. Other Shared-Memory Fields</h2>

<ul>
  <li> Snapshot 1.
  <li> Snapshot 2.
  <li> The meta-page pointer. This value is either 1 or 2. It indicates which
       of the two meta-pages contains the most recent database snapshot.
  <li> READER lock values.
</ul>

<h2 id=log_file>2.5. Log file</h2>

<a href=../src/lsm_log.c>lsm_log.c</a>.

<h1 id=database_recovery_and_shutdown>3. Database Recovery and Shutdown</h1>

<h2 id=read-write_clients>3.1. Read-write clients</h2>

<p>
Exclusive locks on locking region DMS1 are used to serialize all connect and
disconnect operations performed by read-write clients. 

<p>When an LSM database connection is opened (i.e. lsm_open() is called):

<pre>
  lock(DMS1, EXCLUSIVE)           # Block until successful
    lock(DMS2, EXCLUSIVE)         # Abandon if not immediately successful
    if( DMS2 successfully locked ){
      zero *-shm file (or equivalent)
      run recovery
    }
    if( error during recovery ){
      unlock(DMS2)
    }else{
      lock(DMS2, SHARED)          # "cannot" fail
    }
  unlock(DMS1)
</pre>

<p>
Running recovery involves reading the database file header and log file
to initialize the contents of shared-memory. Recovery is only run when the
first connection connects to the database file. There are no circumstances
(including the unexpected failure of a writer process) that may cause 
recovery to take place after the first client has successfully connected.

<p>
Assuming recovery is successful (or not required), the database client is
left holding a SHARED lock on DMS2. This lock is held for as long as the
database connection remains open.

<p>
When disconnecting from a database (i.e. an lsm_close() call following a
successful lsm_open()):

<pre>
  lock(DMS1, EXCLUSIVE)           # Block until successful
    lock(DMS2, EXCLUSIVE)         # Abandon if not immediately successful
    if( DMS2 successfully locked ){
      ...TODO...
      delete *-shm file (or equivalent)
    }
    unlock(DMS2)
  unlock(DMS1)
</pre>

<h2 id=read-only_clients>3.1. Read-only clients</h2>

<p>It is assumed that read-only clients may take SHARED locks only. And
that a read-only client may not run database recovery when a db is opened 
in multi-process mode.

<p>

<h1 id=database_operations>4. Database Operations </h1>

<h2 id=reading>4.1. Reading</h2>

<p>
Opening a read transaction:

<ol>
  <li> <p>Load the current tree-header from shared-memory.

................................................................................
Once a read transaction is opened, the reader may continue to read the 
versions of the in-memory tree and database file for as long as the READER
lock is held.
<p>
To close a read transaction all that is required is to drop the SHARED lock
held on the READER slot. 

<h2 id=writing>4.2. Writing</h2>

<p>
To open a write transaction:

<ol>
  <li> <p>Open a read transaction, if one is not already open.
  <li> <p>Obtain the WRITER lock.
................................................................................

  <li> Sweep the shared-memory area to rebuild the linked list of chunks so
       that it is consistent with the current tree-header.

  <li> Clear the writer flag.
</ol>

<h3 id=flushing_the_in-memory_tree_to_disk>4.2.1. Flushing the in-memory tree to disk</h3>

<p>
For the purposes of writing, the database file and the in-memory tree are
largely independent. Processes holding the WRITER lock write to the in-memory
tree, and processes holding the WORKER lock write to the database file.

<ol>
................................................................................

  <li> Update the private copy of the tree-header to reflect a new, empty tree.

  <li> Commit the write transaction, writing the new, empty tree to
       shared-memory.
</ol>

<h3 id=shared-memory_management>4.2.2. Shared-memory management</h3>

<p>
A writer client may have to allocate new shared-memory chunks. This can be
done either by extending the shared-memory region or by recycling the first
chunk in the linked-list. To check if the first chunk in the linked-list may
be reused, the writer must check that:

................................................................................

  <li> The chunk is not part of an in-memory tree being used by an existing
       reader. A writer checks this by scanning (and possibly updating) the
       values associated with the READER locks - similar to the way SQLite 
       does in WAL mode.
</ul>

<h3 id=log_file_management>4.2.3. Log file management</h3>

<p>
A writer client also writes to the log file. All information required to write
to the log file  (the offset to write to and the initial checksum values) is
embedded in the tree-header. Except, in order to reuse log file space (wrap
around to the start of the log file), a writer needs to know that the space
being recycled will not be required by any recovery process in the future.
................................................................................
<p>
To determine whether or not the log file can be wrapped, the writer requires
access to information stored in the newest snapshot written into the database
header. Their exists a shared-memory variable indicating which of the two
meta-pages contain this snapshot, but the writer process still has to read the
snapshot data and verify its checksum from disk.

<h2 id=working>4.3. Working</h2>

<p>
Working is similar to writing. The difference is that a "writer" modifies
the in-memory tree. A "worker" modifies the contents of the database file.

<ol>
  <li> <p>Take the WORKER lock.
................................................................................
  <li> <p>Invoke xShmBarrier().

  <li> <p>Update snapshot-1 in shared-memory.

  <li> <p>Release the WORKER lock.
</ol>

<h3 id=free-block_list_management>4.3.1. Free-block list management</h3>

<p>
Worker clients occasionally need to allocate new database blocks or move
existing blocks to the free-block list. Along with the block number of each
free block, the free-block list contains the snapshot-id of the first 
snapshot created after the block was moved to the free list. The free-block
list is always stored in order of snapshot-id, so that the first block in
................................................................................
       header. This is done by reading (and verifying the checksum) of the
       snapshot currently stored in the database meta-page indicated by the
       shared-memory variable.
</ul>



<h2 id=checkpoint_operations>4.4. Checkpoint Operations</h2>

<ol>
  <li> Take CHECKPOINTER lock.

  <li> Load snapshot-1 from shared-memory. If the checksum does not match
       the content here, release the CHECKPOINTER lock and abandon the 
       attempt to checkpoint the database.
................................................................................
  <li> Sync the database file again.

  <li> Update the shared-memory variable to indicate the meta-page written in
       step 5.

  <li> Drop the CHECKPOINTER lock.
</ol>

<h1 id=scheduling_policies>5. Scheduling Policies</h1>

<p>
When a client writes to a database, the in-memory tree and log file are
updated by the client itself before the lsm_write() call returns. Eventually, 
once sufficient writes have accumulated in memory, the client marks the 
current tree as "old", and subsequent writes are accumulated in a new tree.

<p>
In order to prevent the in-memory tree and log file from growing indefinitely,
at some point in the future the following must occur:

<ul>
  <li>The contents of the old tree must be written into the database file
      (a WORKER lock operation). Once this is done the memory used to store the
      old tree is available for reuse.

  <li>A checkpoint operation must take place to sync the data into the 
      database file and update the database header (a CHECKPOINT lock 
      operation). Once this has been done the log file space that was used 
      to store the data may be reclaimed.
</ul>

<p>
In addition to the above, it is necessary to perform a certain amount of 
work on the database to merge existing levels together. This is not just
to speed up queries - there is a hard limit of roughly 40 levels to stop
database snapshots from growing overly large.

<p><b> Explicit Calls to lsm_work() and lsm_checkpoint() </b>

<p><b> Compulsory work </b>

<ul>
  <li><p> If a writer tries to mark a tree as "old", but there is already an
       old tree in-memory, the writer attempts to grab the WORKER lock and
       write both the old and new tree to a new database level.

      <p> If the WORKER lock cannot be obtained immediately, block until it
       can be
</ul>

<p><b> Auto work </b>










>
>



|
>
>
|
>
|
|
|
|
|
|
|
|
|
<
<
|
|
|
|
|
|
|
|
|
<







 







|

|
>

<
>
>
|

|





|






|




|




|







 







|







|



|

|





|




|
|
|
|
|
|


>

>

<
>

<
>
>
>
>
>
>
>
>
>
>
>
>
>

<
<
>
>
>
>
>

<
<
<
>
>
>
>
>
>
>
>
>
>
>
>
|
<
>

<
<
<
<
<
>
>
>
>
>
>

>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|
<
<
<
<
>
>
>
>
|

<
<
<
<
>
>
>

<
<
<
>
>
>
>
>
>
>
>
>
>
|
<
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

>
>
>
>
>








|







 







|







 







|







 







|







 







|













|







 







|







 







|









|



<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
|

|







 







|







 







|







 







|







 







|







 







|







 







|







 







|







 








<

<
<
<
<
<

<
<
<

<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


24
25
26
27
28
29
30
31
32

33
34
35
36
37
38
39
..
73
74
75
76
77
78
79
80
81
82
83
84

85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
...
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166

167
168

169
170
171
172
173
174
175
176
177
178
179
180
181
182


183
184
185
186
187
188



189
190
191
192
193
194
195
196
197
198
199
200
201

202
203





204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230




231
232
233
234
235
236




237
238
239
240



241
242
243
244
245
246
247
248
249
250
251


252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
...
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
...
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
...
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
...
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
...
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
...
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562




























































563
564
565
566
567
568
569
570
571
572
...
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
...
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
...
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
...
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
...
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
...
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
...
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
...
841
842
843
844
845
846
847
848

849





850



851






































<title>LSM Design Overview</title>
<nowiki>




<div id=start_of_toc></div>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#summary style=text-decoration:none>1. Summary </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#locks style=text-decoration:none>2. Locks </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#database_connect_and_disconnect_operations style=text-decoration:none>3. Database Connect and Disconnect Operations</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#read-write_clients style=text-decoration:none>3.1. Read-write clients</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#read-only_clients style=text-decoration:none>3.2. Read-only clients</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#data_structures style=text-decoration:none>4. Data Structures </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#database_file style=text-decoration:none>4.1. Database file</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#sorted_runs style=text-decoration:none>4.1.1. Sorted Runs</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#levels style=text-decoration:none>4.1.2. Levels</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#snapshots style=text-decoration:none>4.1.3. Snapshots</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#in-memory_tree style=text-decoration:none>4.2. In-Memory Tree</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#memory_allocation style=text-decoration:none>4.2.1. Memory Allocation</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#header_fields style=text-decoration:none>4.2.2. Header Fields</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#other_shared-memory_fields style=text-decoration:none>4.3. Other Shared-Memory Fields</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#log_file style=text-decoration:none>4.4. Log file</a><br>


&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#database_operations style=text-decoration:none>5. Database Operations </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#reading style=text-decoration:none>5.1. Reading</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#writing style=text-decoration:none>5.2. Writing</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#flushing_the_in-memory_tree_to_disk style=text-decoration:none>5.2.1. Flushing the in-memory tree to disk</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#shared-memory_management style=text-decoration:none>5.2.2. Shared-memory management</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#log_file_management style=text-decoration:none>5.2.3. Log file management</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#working style=text-decoration:none>5.3. Working</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#free-block_list_management style=text-decoration:none>5.3.1. Free-block list management</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#checkpoint_operations style=text-decoration:none>5.4. Checkpoint Operations</a><br>


<div id=end_of_toc></div>

<h1 id=summary>1. Summary </h1>

The LSM embedded database software stores data in three distinct data
structures:
................................................................................
<p>
When an application writes to the database, the new data is written to the
in-memory tree. Once the in-memory tree has grown large enough, its contents
are written into the database file as a new sorted run. To reduce the number
of sorted runs in the database file, chronologically adjacent sorted runs 
may be merged together into a single run, either automatically or on demand.

<h1 id=locks>2. Locks</h1> 
Read/write (shared/exclusive) file locks are used to control concurrent 
access. LSM uses the following "locking regions". Each locking region may
be locked and unlocked separately.


<p>
<table style="margin:0 4ex">
  <tr><td valign=top style="width:16ex">DMS1
      <td><p style=margin-top:0>
          This locking region is used to serialize all connection and
          disconnection operations performed by read-write database
          connections. An EXCLUSIVE lock is taken for the duration of all such
          operations.

          <p>Additionally, read-only connections take a SHARED lock on this
          locking region while attempting to connect to a database. This ensures
          that a read-only connection does not attempt to connect to the
          database while a read-write clients connection or disconnection
          operation is ongoing.

  <tr><td valign=top>DMS2
      <td><p style=margin-top:0>
          Read-write connections hold a SHARED lock on this locking region for
          as long as they are connected to the database.

  <tr><td valign=top>DMS3
      <td><p style=margin-top:0>
          Read-only connections hold a SHARED lock on this locking region for
          as long as they are connected to the database.

  <tr><td valign=top>RWCLIENT(n)
      <td><p style=margin-top:0>
          There are a total of 16 RWCLIENT locking regions. After a read-write
          client connects to the database it attempts to find a free RWCLIENT
          locking slot to take an EXCLUSIVE lock on. If it cannot find one,
          this is not an error. If it can, then the lock is held for as long 
          as the read-write client is connected to the database for.

          <p>The sole purpose of these locks is that they allow a read-only
          client to detect whether or not there exists at least one read-write
................................................................................
          one or more connected read-write clients but none of them hold a
          RWCLIENT lock. This is not important - if a read-only client fails
          to detect that the system has read-write clients it may be less
          efficient, but will not malfunction.

  <tr><td valign=top>WRITER
      <td><p style=margin-top:0>
          A database client holds an EXCLUSIVE lock on this locking region
          while writing data to the database. Outside of recovery, only clients
          holding this lock may modify the contents of the in-memory b-tree.
          Holding this lock is synonymous with having an open write transaction
          on the database.

  <tr><td valign=top>WORKER
      <td><p style=margin-top:0>
          A database client holds an EXCLUSIVE lock on this locking region
          while performing database work (writing data into the body of the
          database file).

  <tr><td valign=top>CHECKPOINTER
      <td><p style=margin-top:0>
          A database client holds an EXCLUSIVE lock on this locking region
          while performing a checkpoint (syncing the database file and 
          writing to the database header).

  <tr><td valign=top>ROTRANS
      <td><p style=margin-top:0>
          A read-only database client holds a SHARED lock on this locking region
          while reading from a non-live database system.

  <tr><td valign=top>READER(n)
      <td><p style=margin-top:0>
          There are a total of 6 READER locking regions. Unless it is a
          read-only client reading from a non-live database, a client holds a
          SHARED lock on one of these while it has an open read transaction.
          Each READER lock is associated with a pair of id values identifying
          the regions of the in-memory tree and database file that may be read
          by clients holding such SHARED locks.
</table>

<h1 id=database_connect_and_disconnect_operations>3. Database Connect and Disconnect Operations</h1>

<h2 id=read-write_clients>3.1. Read-write clients</h2>


<p>When an LSM database connection is opened (i.e. lsm_open() is called):


<pre>
  lock(DMS1, EXCLUSIVE)                # Block until successful
    trylock(DMS2+DMS3, EXCLUSIVE)
    if( trylock() successful ){
      zero shared memory and run recovery
    }
    if( no error during recovery ){
      lock(DMS2, SHARED)               # "cannot" fail
      lock(RWCLIENT(x), EXCLUSIVE)     # see comment below
    }
  unlock(DMS1)
</pre>

<p>


Running recovery involves reading the database file header and log file
to initialize the contents of shared-memory. Recovery is only run when the
first connection connects to the database file. There are no circumstances
(including the unexpected failure of a writer process) that may cause 
recovery to take place after the first client has successfully connected.




<p>
After the SHARED lock on DMS2 is taken, an effort is made to find a free
RWCLIENT locking region and take an EXCLUSIVE lock on it. If no such region
can be found, this step is omitted. It is not an error if this happens.

<p>
Assuming recovery is successful (or not required), the database client is
left holding a SHARED lock on DMS2 and, possibly, an EXCLUSIVE lock on one
of the RWCLIENT locks. These locks are held for as long as the database
connection remains open.

<p>
When disconnecting from a database (i.e. an lsm_close() call following a

successful lsm_open()):






<pre>
  lock(DMS1, EXCLUSIVE)           # Block until successful
    trylock(DMS2, EXCLUSIVE)
    if( trylock() successful ){
      flush in-memory tree
      checkpoint database

      trylock(DMS3, EXCLUSIVE)
      if( trylock() successful ){
        delete shared memory
      }

      trylock(ROTRANS, EXCLUSIVE)
      if( trylock() successful ){
        unlink log file
      }
    }
    unlock(RWCLIENT(x))           # If holding RWCLIENT lock
    unlock(DMS2)
  unlock(DMS1)
</pre>

<h2 id=read-only_clients>3.2. Read-only clients</h2>

<p>It is assumed that read-only clients:

<ul> 




  <li>may take SHARED locks only,
  <li>may not write to shared-memory, the database file or the log file, and
  <li>may not use the trylock(x, EXCLUSIVE) primitive to detect SHARED locks
      held by other clients. 
</ul>





<p>A read-only client does not attempt to connect to the database from within
the lsm_open() call. Instead, this is defered until the first time the client
attempts to read from the database.




<pre>
  lock(DMS1, SHARED)              # Block until successful
    trylock(RWCLIENT(all), SHARED)
    if( trylock() successful ){
      # System is not live. The database must be read directly from disk.
      lock(ROTRANS, SHARED)
    }else{
      # Client is now connected. The read transaction may now be opened
      # in the same way as it would be for a read-write client.
      lock(DMS3, SHARED)
    }


  unlock(DMS1)
</pre>

<p>
Assuming no error occurs, the procedure above leads to one of two possible
outcomes:

<ul> 
  <li> <p>DMS3 is locked and the read-only client is now connected to the 
       database. From this point on, the read-only client uses the same
       procedure to open a read transaction as any other client. The lock on 
       DMS3 is held until the client disconnects from the database. 

  <li> <p>ROTRANS is locked and the read-only client is still disconnected.
       Holding the lock on ROTRANS guarantees that no read-write client
       will overwrite any existing data in the database file or log file.
       This allows the read-only client to run a disconnected read 
       transaction - it recovers any data in the log file into private memory,
       then reads data as required from the database file for the duration
       of the users read transaction.
</ul>

<p>A disconnected read transaction is closed by dropping the ROTRANS lock.


<h1 id=data_structures>4. Data Structures </h1>

<p>
In the following sections, "the WRITER lock", refers to an exclusive lock
on the WRITER locking region. For example "holding the WRITER lock" is
equivalent to "holding an exclusive lock on the WRITER locking region".
Similar interpretations apply to "the WORKER lock" and "the CHECKPOINTER 
lock".

<h2 id=database_file>4.1. Database file</h2>

<p>
This section summarizes the contents of the database file informally. A 
detailed description is found in the header comments for source code files 
<a href="../src/lsm_file.c">lsm_file.c</a> (blocks, pages etc.), 
<a href="../src/lsm_sorted.c">lsm_sorted.c</a> (sorted run format) and 
<a href="../src/lsm_ckpt.c">lsm_ckpt.c</a> (database snapshot format). 
................................................................................

<p>
As with an SQLite database file, each page in the database may be addressed 
by its 32-bit page number. This means the maximum database size is roughly
(pgsz * 2^32) bytes. The first and last pages in each block are 4 bytes
smaller than the others. This is to make room for a single page-number.

<h3 id=sorted_runs>4.1.1. Sorted Runs</h3>

<p>
A single sorted run is spread across one or more database pages (each page
is a part of at most one sorted run). Given the page number of a page in a
sorted run the following statements are true:

<ul>
................................................................................
In other words, given the page numbers of the first and last pages of a 
sorted run and the page number of the root page for the embedded b-tree,
it is possible to traverse the entire run in either direction or query for
arbitrary values.

<p><span style="color:red"> TODO: Embedded pointers.  </span>

<h3 id=levels>4.1.2. Levels</h3>

<p>
Each sorted run is assigned to a "level". Normally, a level consists of a
single sorted run. However, a level may also consist of a set of sorted runs
being incrementally merged into a single run.

<p>
................................................................................
time for all entries.






<h3 id=snapshots>4.1.3. Snapshots</h3>

<p>
Each meta page may contain a database <b>snapshot</b>. A snapshot contains all 
the information required to interpret the remainder of the database file (the
sorted runs and free space). Specifically, it contains:

<ul>
................................................................................
       Recovery and Shutdown" below).
</ul>

<p>
A more detailed description is available in the header comments in
source code file <a href="../src/lsm_ckpt.c">lsm_ckpt.c</a>

<h2 id=in-memory_tree>4.2. In-Memory Tree</h2>

<p>
The in-memory tree is an append-only b-tree of order 4 (each node may have 
up to 4 children), which is more or less equivalent to a red-black tree. 
An append-only tree is convenient, as it naturally supports the 
single-writer/many-readers MVCC concurrency model. 

<p>
The implementation includes some optimizations to reduce the number of 
interior nodes that are updated when a leaf node is written that are not 
described here. See header comments in source code file 
<a href=../src/lsm_tree.c>lsm_tree.c</a> for details.

<h3 id=memory_allocation>4.2.1. Memory Allocation</h3>

<p>
More than one in-memory 
tree may exist in shared-memory at any time. For example in the following 
scenario:

<ol>
................................................................................
but the values that connect the linked list together are not. The writer 
that detects the failure must scan the entire shared-memory region to 
reconstruct the linked list. Any sequence ids assigned by the failed 
writer are reverted (perhaps not to their original values, but to values
that put them at the start of the linked list - before those chunks that
may still be in use by existing readers).

<h3 id=header_fields>4.2.2. Header Fields</h3>
<p>
As well as the in-memory tree data, the following fixed-size fields 
stored in well-known locations in shared-memory are part of the in-memory
tree. Like the in-memory tree data, outside of recovery these fields are only
ever written to by clients holding the WRITER lock.

<ul>
................................................................................
       transaction and cleared after that transaction is successfully 
       concluded - the "writer flag". This is used to detect failures that
       occur mid-transaction. It is only ever read (or written) by clients
       that hold the WRITER lock.
</ul>


<h2 id=other_shared-memory_fields>4.3. Other Shared-Memory Fields</h2>

<ul>
  <li> Snapshot 1.
  <li> Snapshot 2.
  <li> The meta-page pointer. This value is either 1 or 2. It indicates which
       of the two meta-pages contains the most recent database snapshot.
  <li> READER lock values.
</ul>

<h2 id=log_file>4.4. Log file</h2>

<a href=../src/lsm_log.c>lsm_log.c</a>.





























































<h1 id=database_operations>5. Database Operations </h1>

<h2 id=reading>5.1. Reading</h2>

<p>
Opening a read transaction:

<ol>
  <li> <p>Load the current tree-header from shared-memory.

................................................................................
Once a read transaction is opened, the reader may continue to read the 
versions of the in-memory tree and database file for as long as the READER
lock is held.
<p>
To close a read transaction all that is required is to drop the SHARED lock
held on the READER slot. 

<h2 id=writing>5.2. Writing</h2>

<p>
To open a write transaction:

<ol>
  <li> <p>Open a read transaction, if one is not already open.
  <li> <p>Obtain the WRITER lock.
................................................................................

  <li> Sweep the shared-memory area to rebuild the linked list of chunks so
       that it is consistent with the current tree-header.

  <li> Clear the writer flag.
</ol>

<h3 id=flushing_the_in-memory_tree_to_disk>5.2.1. Flushing the in-memory tree to disk</h3>

<p>
For the purposes of writing, the database file and the in-memory tree are
largely independent. Processes holding the WRITER lock write to the in-memory
tree, and processes holding the WORKER lock write to the database file.

<ol>
................................................................................

  <li> Update the private copy of the tree-header to reflect a new, empty tree.

  <li> Commit the write transaction, writing the new, empty tree to
       shared-memory.
</ol>

<h3 id=shared-memory_management>5.2.2. Shared-memory management</h3>

<p>
A writer client may have to allocate new shared-memory chunks. This can be
done either by extending the shared-memory region or by recycling the first
chunk in the linked-list. To check if the first chunk in the linked-list may
be reused, the writer must check that:

................................................................................

  <li> The chunk is not part of an in-memory tree being used by an existing
       reader. A writer checks this by scanning (and possibly updating) the
       values associated with the READER locks - similar to the way SQLite 
       does in WAL mode.
</ul>

<h3 id=log_file_management>5.2.3. Log file management</h3>

<p>
A writer client also writes to the log file. All information required to write
to the log file  (the offset to write to and the initial checksum values) is
embedded in the tree-header. Except, in order to reuse log file space (wrap
around to the start of the log file), a writer needs to know that the space
being recycled will not be required by any recovery process in the future.
................................................................................
<p>
To determine whether or not the log file can be wrapped, the writer requires
access to information stored in the newest snapshot written into the database
header. Their exists a shared-memory variable indicating which of the two
meta-pages contain this snapshot, but the writer process still has to read the
snapshot data and verify its checksum from disk.

<h2 id=working>5.3. Working</h2>

<p>
Working is similar to writing. The difference is that a "writer" modifies
the in-memory tree. A "worker" modifies the contents of the database file.

<ol>
  <li> <p>Take the WORKER lock.
................................................................................
  <li> <p>Invoke xShmBarrier().

  <li> <p>Update snapshot-1 in shared-memory.

  <li> <p>Release the WORKER lock.
</ol>

<h3 id=free-block_list_management>5.3.1. Free-block list management</h3>

<p>
Worker clients occasionally need to allocate new database blocks or move
existing blocks to the free-block list. Along with the block number of each
free block, the free-block list contains the snapshot-id of the first 
snapshot created after the block was moved to the free list. The free-block
list is always stored in order of snapshot-id, so that the first block in
................................................................................
       header. This is done by reading (and verifying the checksum) of the
       snapshot currently stored in the database meta-page indicated by the
       shared-memory variable.
</ul>



<h2 id=checkpoint_operations>5.4. Checkpoint Operations</h2>

<ol>
  <li> Take CHECKPOINTER lock.

  <li> Load snapshot-1 from shared-memory. If the checksum does not match
       the content here, release the CHECKPOINTER lock and abandon the 
       attempt to checkpoint the database.
................................................................................
  <li> Sync the database file again.

  <li> Update the shared-memory variable to indicate the meta-page written in
       step 5.

  <li> Drop the CHECKPOINTER lock.
</ol>