Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Retire shm.wiki. Its contents are now in lsm.wiki.
Downloads: Tarball | ZIP archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 011519e33f21b2d76f6a93cf15ca45d58763a5e8
User & Date: dan 2012-09-10 15:08:48.628
Context
2012-09-10
18:39
Change the way version races between the in-memory tree and lsm are handled in lsm.wiki. check-in: e12aa10a7c user: dan tags: trunk
15:08
Retire shm.wiki. Its contents are now in lsm.wiki. check-in: 011519e33f user: dan tags: trunk
14:45
Updates to lsm.wiki. check-in: cde7d1dcb0 user: dan tags: trunk
Changes
Unified Diff Ignore Whitespace Patch
Changes to www/lsm.wiki.
356
357
358
359
360
361
362
363

364
365
366
367
368
369
370


<h2>Other Shared-Memory Fields</h2>

<ul>
  <li> Snapshot 1.
  <li> Snapshot 2.
  <li> The meta-page pointer.

  <li> READER lock values.
</ul>

<h2>Log file</h2>

<h1>3. Database Recovery and Shutdown</h1>








|
>







356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371


<h2>Other Shared-Memory Fields</h2>

<ul>
  <li> Snapshot 1.
  <li> Snapshot 2.
  <li> The meta-page pointer. This value is either 1 or 2. It indicates which
       of the two meta-pages contains the most recent database snapshot.
  <li> READER lock values.
</ul>

<h2>Log file</h2>

<h1>3. Database Recovery and Shutdown</h1>

Deleted www/shm.wiki.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329

<title>Multi-process LSM Notes</title>
<nowiki>

<p>
Notes on the changes required for LSM to allow connections from 
multiple processes. In other words, notes to do with the contents
of the *-shm file and the way they are accessed and manipulated.


<h2>Contents of shared memory</h2>

<p>
Like SQLite 3 WAL mode, LSM uses a *-shm file. It uses the same
"dead man switch" mechanism to ensure it is always initialized to 
zero when the first client connects.

<p>
The *-shm file contains:

<ol>
  <li> A flag indicating whether or not the *-shm has been initialized
       (log file recovered into in-memory tree, header fields loaded etc.)
  <li> The meta-page number to which a checkpoint was last successfully
       written.
  <li> The client snapshot.
  <li> The worker snapshot.
  <li> The in-memory tree. This takes up most of the space in the file.
</ol>

<p>
The client and worker snapshots are in the same format as those stored
in the header of the database file itself.

<p>
Sometimes data from the meta-page identified by the header field is
required. For example it is necessary to know the id of the last
checkpointed snapshot in order to determine which free blocks are safe
to reuse. The associated log file offset is also required to determine
when the log file may be wrapped. These quantities are read directly
from the meta-page in the database itself as required.

<h2>File locks</h2>

<p>
Lsm uses the same ideas as SQLite in WAL mode. Both SHARED and EXCLUSIVE 
locks are required. There are three exclusive locks:

<ul>
  <li> WRITER: Required to write to in-memory tree and its log file.
  <li> WORKER: Required to write to body of database file.
  <li> CHECKPOINTER: Required to write to database file header.
</ul>

<p>
Only one client may hold each of these locks at one time. In other words,
each of the above is implemented by represents a range of bytes in the file

<p>
There are also N separate locks held by readers. These locks also 
work like WAL locks in that they are a combination of a lock and a
value. In WAL mode the value is a 32-bit integer. For LSM, it will
be two 64-bit integers - an in-memory tree id and a snapshot id.

<h2>Memory allocation</h2>

<p>
Within the *-shm file, memory is allocated in 32KB chunks.

<p>
The first chunk of the file is the header chunk. It contains:

<ol>
  <li> The client snapshot (4KB)
  <li> The worker snapshot (4KB)
  <li> The "initialized" flag (4 bytes)
  <li> The meta-page number containing the last checkpoint written (4
       bytes)
  <li> The in-memory tree headers (see below).
</ol>

<p>
The second and subsequent chunks are used to store the in-memory tree
data.

<p>
The in-memory tree structure is essentially an append-only rb-tree
with some modifications to reduce the amount of data written.
Multiple trees will sometimes be present in the file. To cope with
circumstances like the following:

<ul>
  <li> Writer builds tree A.
  <li> Reader takes a read lock on tree A.
  <li> Tree A is flushed to the db.
  <li> Writer begins building tree B.
  <li> Reader continues reading from tree A.
</ul>

<p>
In this case, the chunks used by tree A may not be reused until after
the active read transaction has concluded.

<p>
Each chunk begins with three 32-bit integer fields:
<ul>
  <li> Id of first tree for which data is stored on the chunk,
  <li> Id of last tree for which data is stored on the chunk,
  <li> Chunk number of chunk written after this one (or zero, if this
       is the most recently written chunk).
</ul>

<p>
The third field described above links all tree chunks in the file,
in-use or otherwise, into a single list. To allocate a new chunk,
a writer first checks if the chunk at the head of the list can be
recycled. If so, it moves it to the end of the list and begins
writing to it. Otherwise, it allocates a new chunk at the end of
the file, appends that to the list and continues writing.

<p><b>Crash recovery: But, what happens if a writer crashes while
writing a transaction to the database?</b>

<p>If a writer crashes during a write transaction, readers can 
often continue as normal. However, the next writer must roll 
back any changes made to the db before it can commence a new
transaction. Or, if a writer fails when updating the in-memory 
tree header, it may not be possible for readers to continue. 
This is resolved by having one reader become a writer, restore 
the db, then "commit" the empty transaction.

<p>
The pattern used by a writer is:
<ol>
  <li> Obtain WRITER lock. This is a barrier operation (on Linux, an
  fcntl(F_SETLK)).  
  <li> Update shared memory region.
  <li> Release WRITER lock. Another barrier (on Linux, another F_SETLK).
</ol>

<p> Or, if a failure occurs during step 2, the unlock operation is done
automatically by the OS. Either way, assume that the unlock is also a
barrier (see Documentation/memory-barrier.txt in kernel source tree). It
can therefore be assumed that from the point of view of the subsequent
writer, all writes to the shared memory region completed by the failed
writer appear to have been performed in order - there is no need to
worry that the hardware has reordered the writes made by the failed
writer. The compiler may reorder them, of course, but this should be
easy enough to avoid.

<p>
Also assumed is that 32-bit writes are atomic, in the sense that it
is not possible for a failure in a writer process to result in some
bits of a 32-bit word being updated and some remaining in their 
original state.

<p>
Crashes are then managed by the following:

<ul>
  <li>When a write transaction is opened, a flag is set in the in-memory
  tree header. This indicates that a transaction is underway. The same
  flag is cleared right before the WRITER lock is released to commit or
  roll back the transaction. 

  <li>When a recyclable chunk is moved from the start of the linked list
  to the end, the first thing done is that the "first tree" field is
  updated. Then the "last tree". Then the header pointer is set to point
  to the next element in the list.

  <li>If the header flag is already set when the writer grabs the WRITER
  lock, then a crash must have occurred. In this case the free-list must
  be recovered.

  <li>Recovering the free list involves two steps: First a linear scan
  of the current tree to identify those chunks in use (and also for
  another reason, see below). Second, a scan of the remainder of the
  file checking the "first tree" field of all chunks that either belong
  to an earlier tree or appear to belong to the current tree but are not
  linked in anywhere. Based on this, the new writer can rebuild the
  free-list.

</ul>


<h2>In-memory tree format</h2>

<p>
Header fields:

<ul>
  <li> 32-bits: Tree id (incremented for each new tree).
  <li> 32-bits: Transaction id (incremented for each new transaction).
  <li> 32-bits: Pointer to head of tree (an offset within the *-shm
       file).
  <li> 32-bits: Height of tree.
  <li> 64-bits: Last checkpoint id for which log file space has already
                been reclaimed.
  <li> DbLog structure (see lsmInt.h).
  <li> 32-bits: Header checksum 1.
  <li> 32-bits: Header checksum 2.
</ul>

<p>
There are two copies of the in-memory tree header. Both stored on
the *-shm header chunk. Copy 1 and copy 2.

<p>
To commit a transaction, a writer does the following:

<ol>
  <li> Updates copy 2 of the header,
  <li> Invokes a memory barrier,
  <li> Updates copy 1 of the header,
  <li> Clears the "transaction in progress flag",
  <li> Drops the WRITER lock.
</ol>

<p>
To open a read transaction, the reader:

<ol>
  <li> Reads copy 1 of the header.

  <li> If the checksum fails, attempt to obtain the WRITER lock. If
       successful, do the equivalent of opening and committing an
       empty transaction (see below). Either way, return to 1 and
       attempt to reread the in-memory tree header. If copy 1 cannot be
       read within some reasonable amount of time...?

  <li> Read the client shapshot from shared memory. If the checksum
       fails, attempt to obtain the WORKER lock. If successful, copy
       the worker snapshot over the client snapshot and drop the WORKER
       lock. Successful or otherwise, attempt to reread the snapshot.
       If this cannot be completed within some reasonable amount of
       time...?

  <li> Grab a read-lock corresponding to the tree id and snapshot ids
       just read (note: assume that this is a memory barrier).

  <li> Check that the shared memory tree header and client snapshot
       still contain the ids for which the lock was obtained. If not, 
       drop the lock and go back to step 1.
</ol>

<p>To open a write transaction, the writer:

<ol>
  <li> Opens a read transaction, if one is not already open.

  <li> Obtain the WRITER lock.

  <li> Check the "transaction in progress" flag. If it is set,
       perform the emergency rollback and freelist recovery, then
       clear the flag.

  <li> Check that copy 1 of the header still matches the copy read
       when the read transaction was opened. If not, drop the lock
       and return LSM_BUSY.

  <li> Set the "transaction in progress" flag.
</ol>

<p>
Emergency rollback and recovery:
<ol>
  <li> If the checksum of copy 1 of the header fails, replace it with
       the contents of copy 2.

  <li> Iterate through the entire tree, rolling back any nodes with
       transaction ids that indicate they require it. Record the blocks
       occupied by the current tree.

  <li> Scan through the entire *-shm memory file, inspecting the "first
       tree" fields of each chunk.
</ol>

<p>
    Large values or keys may overflow chunks.

<h2>Client and worker snapshots</h2>

<p>
The client and worker snapshots stored in the *-shm file use the
same format as the checkpoint written to the database file. Except,
they are always in native byte order. Each is stored in a dedicated
4KB slot, as in the database file. A client must hold the WORKER
lock to modify either of the two snapshots.

<p>
To work on the database file, a worker performs the following:
<ol>
  <li> Obtain the WORKER lock.

  <li> Copies the worker snapshot from the shared-memory region into
       heap memory and verifies that the checksum computes.

  <li> If the checksum of the worker snapshot does not compute, copy
       the client snapshot over the top of the worker and reload it.
       If the checksum still does not compute, return LSM_CORRUPT.

  <li> Perform some merging work on the database. Generate a new
       worker snapshot. Write it over the top of the old.

  <li> Optionally, copy the new worker snapshot over the top of the
       client snapshot. TODO: Copying the worker snapshot into the
       client slot makes the worker read-only.... Currently, LSM
       distinguishes between read-only and read-write worker snapshots.
       But that would mean an extra flag in shared-memory. Perhaps its
       better to consider all worker snapshots to be read-only. Or,
       change the format slightly to include a "read-write" flag that
       can be set for those snapshots not copied into the client slot. 
       UPDATE: Current code already treats all worker snapshots as read-only.

  <li> Release the WORKER lock.
</ol>

<p>
To checkpoint a snapshot.
<ol>
    <li> Obtain the CHECKPOINTER lock.
    <li> Read the client snapshot.
    <li> Sync the database file.
    <li> Write the client snapshot into the appropriate meta-page (based
         on the "last checkpoint slot" field in the *-shm header).
    <li> Sync the database file.
    <li> Update the "last checkpoint slot" field.
    <li> Drop the CHECKPOINTER lock.
</ol>
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<