SQLite4
Check-in [e12aa10a7c]
Not logged in

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Change the way version races between the in-memory tree and lsm are handled in lsm.wiki.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: e12aa10a7c6c1b80d14f6237021eb4e459128eb3
User & Date: dan 2012-09-10 18:39:53
Context
2012-09-10
20:07
Change locks to use 32-bit shm-sequence-ids instead of 64-bit tree-ids. check-in: acf38e32c8 user: dan tags: trunk
18:39
Change the way version races between the in-memory tree and lsm are handled in lsm.wiki. check-in: e12aa10a7c user: dan tags: trunk
15:08
Retire shm.wiki. Its contents are now in lsm.wiki. check-in: 011519e33f user: dan tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to www/lsm.wiki.

469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518












519
520
521
522
523
524
525
...
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576






















577
578
579
580
581
582
583
would have to take the WRITER or WORKER lock as appropriate and "repair" the
data structures so that both copies were consistent. See below under "Writing"
and "Working" for details. This would mean we were only depending on the 64-bit
checksums for recovery following an unluckily timed application failure, not
for reader/writer collision detection.



  <li> <p>Check that the tree-header and snapshot are compatible.

       <p>Say a reader loads the tree-header for in-memory tree A, 
          version 49 in step 1. Then, a writer writes a transaction to the
          in-memory tree A (creating version 50) and flushes the whole tree
          to disk, updating the current snapshot in shared-memory. All before
          the reader gets to step 2 above and loads the current snapshot.
          If allowed to proceed, the reader may return bad data or otherwise
          malfunction.

       <p>To prevent this, when a new in-memory tree is created, the current
          snapshot-id (the one that contains all data from the previous
          in-memory tree) is stored as part of the tree-header. The new
          snapshot-id is stored twice inside the snapshot as well - the second
          time as the "data-id" field. Snapshots created by flushing an 
          in-memory tree to disk have the data-id field set to a copy of 
          their snapshot-id. Other snapshots (those created by workers merging
          two or more existing sorted runs together) have the data-id field
          set to a copy of the data-id field of the previous snapshot.
          A snapshot and tree-header are only compatible if the data-id field
          of the snapshot matches the snapshot-id stored in the tree-header.


  <li> <p>Take a READER lock. The snapshot-id field of the reader-lock field
          must be less than or equal to the snapshot-id of the snapshot read in
          step 2. The shm-sequence-id of the reader lock must be less than or
          equal to the sequence-id of the first shared-memory chunk used by the
          in-memory tree according to the tree-header read in step 1.

  <li> <p>Check that the tree-header still contains the same
          first-shm-sequence-id field as that that read in step 1. 

  <li> <p>Check that the snapshot read in step 2 is still current.

       <p>Steps 5 and 6 are also similar. If the check in step 5 fails, there
          is a chance that the shared-memory chunks used to store the in-memory
          tree indicated by the tree-header read in step 1 had already been
          reused by a writer client before the lock in step 4 was obtained. And
          similarly if the check in step 6 fails, the database file blocks
          occupied by the sorted runs in the snapshot read in step 2 may have
          already been reused by new content. In either case, if the check
          fails, return to step 1 and start over.












</ol>

<p>
Once a read transaction is opened, the reader may continue to read the 
versions of the in-memory tree and database file for as long as the READER
lock is held.
<p>
................................................................................

<p>
If the writer flag is set after the WRITER lock is obtained, the new 
writer assumes that the previous must have failed mid-transaction. In this
case it performs the following:

<ol>

  <li> If the two tree-headers are not identical, copy one over the other.
       Prefer the data from a tree-header for which the checksum computes.
       Or, if they both compute, prefer tree-header-1.

  <li> Sweep the shared-memory area to rebuild the linked list of chunks so
       that it is consistent with the current tree-header.

  <li> Clear the writer flag.
</ol>























<h3>Shared-memory management</h3>

<p>
A writer client may have to allocate new shared-memory chunks. This can be
done either by extending the shared-memory region or by recycling the first
chunk in the linked-list. To check if the first chunk in the linked-list may







<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<







|
<











>
>
>
>
>
>
>
>
>
>
>
>







 







<









>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







469
470
471
472
473
474
475























476
477
478
479
480
481
482
483

484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
...
548
549
550
551
552
553
554

555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
would have to take the WRITER or WORKER lock as appropriate and "repair" the
data structures so that both copies were consistent. See below under "Writing"
and "Working" for details. This would mean we were only depending on the 64-bit
checksums for recovery following an unluckily timed application failure, not
for reader/writer collision detection.


























  <li> <p>Take a READER lock. The snapshot-id field of the reader-lock field
          must be less than or equal to the snapshot-id of the snapshot read in
          step 2. The shm-sequence-id of the reader lock must be less than or
          equal to the sequence-id of the first shared-memory chunk used by the
          in-memory tree according to the tree-header read in step 1.

  <li> <p>Check that the tree-header read in step 1 is still current.


  <li> <p>Check that the snapshot read in step 2 is still current.

       <p>Steps 5 and 6 are also similar. If the check in step 5 fails, there
          is a chance that the shared-memory chunks used to store the in-memory
          tree indicated by the tree-header read in step 1 had already been
          reused by a writer client before the lock in step 4 was obtained. And
          similarly if the check in step 6 fails, the database file blocks
          occupied by the sorted runs in the snapshot read in step 2 may have
          already been reused by new content. In either case, if the check
          fails, return to step 1 and start over.

       <p>Say a reader loads the tree-header for in-memory tree A, 
          version 49 in step 1. Then, a writer writes a transaction to the
          in-memory tree A (creating version 50) and flushes the whole tree
          to disk, updating the current snapshot in shared-memory. All before
          the reader gets to step 2 above and loads the current snapshot.
          If allowed to proceed, the reader may return bad data or otherwise
          malfunction.

       <p>Step 4 should prevent this. If the above occurs, step 4 should 
          detect that the tree-header loaded in step 1 is no longer current.

</ol>

<p>
Once a read transaction is opened, the reader may continue to read the 
versions of the in-memory tree and database file for as long as the READER
lock is held.
<p>
................................................................................

<p>
If the writer flag is set after the WRITER lock is obtained, the new 
writer assumes that the previous must have failed mid-transaction. In this
case it performs the following:

<ol>

  <li> If the two tree-headers are not identical, copy one over the other.
       Prefer the data from a tree-header for which the checksum computes.
       Or, if they both compute, prefer tree-header-1.

  <li> Sweep the shared-memory area to rebuild the linked list of chunks so
       that it is consistent with the current tree-header.

  <li> Clear the writer flag.
</ol>

<h3>Flushing the in-memory tree to disk</h3>

<p>
For the purposes of writing, the database file and the in-memory tree are
largely independent. Processes holding the WRITER lock write to the in-memory
tree, and processes holding the WORKER lock write to the database file.

<ol>
  <li> If a write transaction is open, write the new tree-header to
       shared-memory (as described above). Otherwise, if no write-transaction
       is open, open one - repairing the tree headers if required.

  <li> Execute a WORKER transaction (see "Working" below) to add a new level
       to the LSM. The new level contains the current contents of the 
       in-memory tree.

  <li> Update the private copy of the tree-header to reflect a new, empty tree.

  <li> Commit the write transaction, writing the new, empty tree to
       shared-memory.
</ol>

<h3>Shared-memory management</h3>

<p>
A writer client may have to allocate new shared-memory chunks. This can be
done either by extending the shared-memory region or by recycling the first
chunk in the linked-list. To check if the first chunk in the linked-list may