SQLite4
Check-in [33eca2e1f4]
Not logged in

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Simplifications and clarifications to lsmusr.wiki.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 33eca2e1f4379d3030c278f53e6cc67c3d39f7c0
User & Date: dan 2013-02-01 19:49:42
Context
2013-02-02
16:45
Fix LSM single-process mode so that it holds an exclusive lock on the database file - preventing connections from within external processes. check-in: d6bd08ca0e user: dan tags: trunk
2013-02-01
19:49
Simplifications and clarifications to lsmusr.wiki. check-in: 33eca2e1f4 user: dan tags: trunk
2013-01-31
05:58
Add the definition of sqlite4_stricmp() to sqlite.h.in. Avoid multiple declarations the u8 and similar typedefs in the amalgmation. check-in: d966049dd6 user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to www/lsmusr.wiki.

3
4
5
6
7
8
9





10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
..
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
..
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
...
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
...
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
...
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782



783
784
785
















































































































786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
...
804
805
806
807
808
809
810













































811
812
813
814
815
816
817
...
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837























838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934

935
936



937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104

1105
1106
1107
1108
1109
1110
1111
1112





1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157


1158
1159
1160

1161
1162
1163
1164



1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175


1176
1177
1178
1179
1180
1181
1182
1183
1184
....
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
....
1263
1264
1265
1266
1267
1268
1269
1270
1271






1272










1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283

1284
1285
1286
1287
1288
1289
1290
1291
1292





1293
1294
1295
1296
1297
1298
1299
1300
1301
1302

1303
1304
1305
1306


1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
....
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370

1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388


<nowiki>

<h2>Table of Contents</h2>










<div id=start_of_toc></div>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#introduction_to_lsm style=text-decoration:none>1. Introduction to LSM</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#using_lsm_in_applications style=text-decoration:none>2. Using LSM in Applications </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#basic_usage style=text-decoration:none>3. Basic Usage</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#opening_and_closing_database_connections style=text-decoration:none>3.1. Opening and Closing Database Connections </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#writing_to_a_database style=text-decoration:none>3.2. Writing to a Database </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#reading_from_a_database style=text-decoration:none>3.3. Reading from a Database </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#database_transactions_and_mvcc style=text-decoration:none>3.4. Database Transactions and MVCC </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#data_durability style=text-decoration:none>4. Data Durability </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#compressed_and_encrypted_databases style=text-decoration:none>5. Compressed and Encrypted Databases </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#performance_tuning style=text-decoration:none>6. Performance Tuning</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#performance_related_configuration_options style=text-decoration:none>6.1. Performance Related Configuration Options </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#using_worker_threads_or_processes style=text-decoration:none>6.2. Using Worker Threads or Processes </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#architectural_overview style=text-decoration:none>6.2.1. Architectural Overview </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#automatic_work_and_checkpoint_scheduling style=text-decoration:none>6.2.2. Automatic Work and Checkpoint Scheduling</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#explicit_work_and_checkpoint_scheduling style=text-decoration:none>6.2.3. Explicit Work and Checkpoint Scheduling</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#compulsary_work_and_checkpoint_scheduling style=text-decoration:none>6.2.4. Compulsary Work and Checkpoint Scheduling</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#database_file_optimization style=text-decoration:none>6.3. Database File Optimization</a><br>

<div id=end_of_toc></div>

<h2>Overview</h2>

<p>This document describes the LSM embedded database library and use thereof. 
It is intended to be part user-manual and part tutorial. It is intended to
complement the <a href=lsmapi.wiki>LSM API reference manual</a>.

<p>The <a href=#introduction_to_lsm>first section</a> of this document contains
a description of the LSM library and its features. 
<a href=#using_lsm_in_applications>Section 2</a> describes how to use LSM from
within a C or C++ application (how to compile and link LSM, what to #include
etc.). The <a href=#basic_usage>third section</a> describes the essential APIs
that applications use to open and close database connections, and to read from
................................................................................

<p>The three sections described above contain all the information required to
create applications that use LSM. The remaining sections discuss more
specialized topics. <a href=#data_durability>Section 4</a> discusses the
configuration parameter that influences transaction durability (the guarantees
offered with respect to recently committed transactions if a power failure 
occurs). <a href=#compressed_and_encrypted_databases>Section 5</a> explains
the interface provided by LSM that allow external data compression and/or
encryption functions to be used to create compressed and/or encrypted
databases. <a href=#performance_tuning>Section 6</a> deals with the topic
of performance tuning.

<h1 id=introduction_to_lsm>1. Introduction to LSM</h1>

<p>LSM is an embedded database library for key-value data, roughly similar
in scope to
<a href="http://www.oracle.com/technetwork/products/berkeleydb/overview/index.html">Berkeley DB</a>, 
<a href="http://code.google.com/p/leveldb/">LevelDB</a> or
................................................................................
  <li> Iterating through a range of database keys (either forwards or
       backwards).
</ul>

<p>Other salient features are:

<ul>
  <li><p>LSM supports a <b>single-writer/multiple-reader MVCC</b> based
      transactional concurrency model. SQL style nested sub-transactions are
      supported. Clients may concurrently access a single LSM database from
      within a single or multiple application processes. 

  <li><p>An entire LSM database is stored in a <b>single file on disk</b>. 

  <li><p>Data <b>durability in the face of application or power failure</b>.
      LSM may optionally use a write-ahead log file when writing to the
      database to ensure committed transactions are not lost if an application
      or power failure occurs.

  <li>LSM <b>may be configured to use external data compression and/or
      encryption routines</b> to create and access compressed and/or encrypted
      databases.
</ul>
    

<p>Many database systems that support range queries, including <a
href=http://www.sqlite.org>SQLite 3</a>, Berkeley DB and Kyoto Cabinet, are
based on one of many variants of the 
<a href="http://en.wikipedia.org/wiki/B-tree">b-tree data structure</a>.
B-trees are attractive because a b-tree structure minimizes the number of disk
sectors that must be read from disk when searching the database for a specific
key. However, b-tree implementations usually suffer from poor write
localization - updating the contents of a b-tree often involves modifying the
contents of nodes scattered throughout the database file. If the database is
stored on a spinning disk (HDD), then the disk heads must be moved before
writing non-contiguous sector, which is extremely slow. If the database is
stored on solid state storage (SDD) a similar phenomena is encountered due to
the large erase-block sizes. In general, writing to a series of contiguous disk
sectors is orders of magnitude faster than updating to the same number of disk
sectors scattered randomly throughout a large file. Additionally, b-tree
structures are prone to fragmentation, reducing the speed of range queries.

<p><i>Todo: Should have references for the claims above.</i>

<p><i>Also, fix the link in the next paragraph to point to a description
of the log-structured-merge tree within lsm.wiki (or its successor).</i>

<p>LSM uses a <a href=lsm.wiki>different data structure</a> that makes the
following performance tradeoffs relative to a b-tree:

<ul>
  <li> A very large percentage of the disk sectors modified when writing to
       the database are contiguous.
       Additionally, in many cases the total number of sectors written
       to disk is reduced. This makes writing to an LSM database much
       faster than the equivalent b-tree.

  <li> LSM databases do not suffer from fragmentation to the same degree
       as b-trees. This means that the performance of large range queries 
       does not degrade as the database is updated as it may with a b-tree.

  <li> It is accepted that under some circumstances searching an LSM 
       database for a given key will involve examining more disk sectors
       than it would with a b-tree. In terms of disk sectors accessed when
       searching a database of size N, both b-trees and LSM provide O(log(N))
       efficiency, but the base of the logarithm is generally larger for a
       b-tree than for LSM.
</ul>

<p>In other words, writing to an LSM database should be very fast and scanning
through large ranges of keys should also perform well, but searching the
database for specific keys may be slightly slower than when using a b-tree
based system. Additionally, avoiding random writes in favour of largely
contiguous updates (as LSM does) can significantly reduce the wear on SSD or
................................................................................
a pointer to a <a href=lsmapi.wiki#lsm_env>database environment object</a>
or NULL. Almost all applications should pass NULL. A database environment
object allows the application to supply custom implementations of the various
operating system calls that LSM uses to read and write files, allocate heap 
memory, and coordinate between multiple application threads and processes.
This is normally only required if LSM is being used on a platform that is not
supported by default. Passing NULL instructs the library to use the default
implementations of all these things, which is usually the right thing to do.
The second argument to lsm_new() is an output variable. Assuming the call
is successful, *pDb is set to point to the new database handle before 
returning.

<p>The first argument passed to lsm_open() must be an existing database 
handle. The second is the name of the database file to connect to. Once
lsm_open() has been successfully called on a database handle, it can not be
called again on the same handle. Attempting to do so is an LSM_MISUSE error.

<p>For example, to create a new handle and connect it to database "test.db"
................................................................................

<verbatim>
  rc = lsm_close(db);
</verbatim>

<p>It is important that lsm_close() is called to close all database handles
created with lsm_new(), particularly if the connection has written to the
database. If an application that has written to a database exits without
closing its database connection, then subsequent clients may have to run
"database recovery" when they open the database, making the lsm_open() call
less responsive. Additionally, not matching each successful lsm_new() call 
with a call to lsm_close() is a resource leak.

<p>Counter-intuitively, an lsm_close() call may fail. In this case the database
handle is not closed, so if the application exits it invites the "database
recovery" performance problem mentioned above. The usual reason for an
lsm_close() call failing is that the database handle has been used to create
<a href=lsmapi.wiki#lsm_csr_open>database cursors</a> that have not been 
closed. Unless all database cursors are closed before lsm_close() is called,
................................................................................
</i>


<h1 id=performance_tuning>6. Performance Tuning</h1>

<p> This section describes the various measures that can be taken in order to
fine-tune LSM in order to improve performance in specific circumstances.
Sub-section 6.1 identifies the 
<a href=#performance_related_configuration_options> configuration
parameters</a> that can be used to influence database performance. 
Sub-section 6.2 discusses methods for shifting the time-consuming processes of
actually writing and syncing the database file to 
<a href=#using_worker_threads_or_processes>background threads or processes</a> 
in order to make writing to the database more responsive. Finally, 6.
3 introduces "<a href=#database_file_optimization>database optimization</a>"



- the process of reorganizing a database file internally so that it is as small
as possible and optimized for search queries.

















































































































<h2 id=performance_related_configuration_options>6.1. Performance Related Configuration Options </h2>

<p>The options in this section all take integer values. They may be both
set and queried using the <a href=lsmapi.wiki#lsm_config>lsm_config()</a>
function. To set an option to a value, lsm_config() is used as follows:

<verbatim>
  /* Set the LSM_CONFIG_AUTOFLUSH option to 1MB */
  int iVal = 1 * 1024 * 1024;
  rc = lsm_config(db, LSM_CONFIG_AUTOFLUSH, &iVal);
</verbatim>

<p>In order to query the current value of an option, the initial value of
the parameter (iVal in the example code above) should be set to a negative
value. Or any other value that happens to be out of range for the parameter -
negative values just happen to be out of range for all integer lsm_config()
................................................................................
<verbatim>
  /* Set iVal to the current value of LSM_CONFIG_AUTOFLUSH */
  int iVal = -1;
  rc = lsm_config(db, LSM_CONFIG_AUTOFLUSH, &iVal);
</verbatim>

<dl>













































  <dt> <a href=lsmapi.wiki#LSM_CONFIG_MMAP>LSM_CONFIG_MMAP</a>
  <dd> <p style=margin-top:0>
    If LSM is running on a system with a 64-bit address space, this option
    may be set to either 1 (true) or 0 (false). On a 32-bit platform, it is
    always set to 0.
    <p> If it is set to true, the entire database file is memory mapped. Or, if
    it is false, data is accessed using ordinary OS file read and write
................................................................................
    examined. 
    <p>This option can only be set before lsm_open() is called on the database
    connection.
    <p>The default value is 1 (true) on a 64-bit platform, and 0 otherwise.

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_MULTIPLE_PROCESSES>LSM_CONFIG_MULTIPLE_PROCESSES</a>
  <dd> <p style=margin-top:0>
    This option may also be set to either 1 (true) or 0 (false). If it is
    set to 0, then the library assumes that all database clients are located 
    within the same process (have access to the same memory space). Assuming
    this means the library can avoid using OS file locking primitives to lock
    the database file, which speeds up opening and closing read and write
    transactions. 

    <p>This option can only be set before lsm_open() is called on the database
    connection.
























    <p>The default value is 1 (true).

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_USE_LOG>LSM_CONFIG_USE_LOG</a>
  <dd> <p style=margin-top:0>
    This is another option may also be set to either 1 (true) or 0 (false). 
    If it is set to false, then the library does not write data into the
    database log file. This makes writing faster, but also means that if
    an application crash or power failure occurs, it is very likely that
    any recently committed transactions will be lost.

    <p>If this option is set to true, then an application crash cannot cause
    data loss. Whether or not data loss may occur in the event of a power
    failure depends on the value of the <a href=#data_durability>
    LSM_CONFIG_SAFETY</a> parameter.

    <p>This option can only be set if the connection does not currently have
    an open write transaction.

    <p>The default value is 1 (true).

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_AUTOFLUSH>LSM_CONFIG_AUTOFLUSH</a>
  <dd> <p style=margin-top:0>
    When a client writes to an LSM database, changes are buffered in memory 
    before being written out to the database file in a batch. This option
    is set to the size of the buffer in bytes. The default value is 1048576.
    Increasing this value may improve overall write throughput. Decreasing
    it reduces memory usage.
    

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_AUTOCHECKPOINT>LSM_CONFIG_AUTOCHECKPOINT</a>
  <dd> <p style=margin-top:0>
    This option determines how often the database is checkpointed (synced to
    disk). A checkpoint is performed after each N bytes (approximately) are
    written to the database file, where N is the value of this option. The
    default value is 2097152 (2MB).

    <p>
    Increasing this value (say to 4MB or even 8MB) may improve overall write
    throughput. However, it is important not to checkpoint too infrequently, 
    as:
    <ul>
      <li> <p>
           Space in the log file may only be reused following a checkpoint.
           Once a checkpoint has been completed, all application data written
           before the checkpoint is safely stored in the database file. This
           means the contents of the log file are no longer required, and so
           the next database writer may start writing a new log into the
           start of the existing file (overwriting the old data). 
           As well as consuming disk space, large log files are undesirable 
           as often the entire log file must be read during database recovery
           following an application or system failure.
           
      <li> <p>
           Similarly, space in the database file freed by the continuous
           incremental reorganization of the database file that LSM performs
           cannot be reused until after a checkpoint has been performed. 
           So increasing the value of this parameter causes the space used
           by the pre-reorganization versions of data structures within the
           file to be recycled more slowly. Increasing the overall size of
           database files under some circumstances.
    </ul>
</dl>

<p>The roles of the LSM_CONFIG_AUTOFLUSH and LSM_CONFIG_AUTOCHECKPOINT options 
are discussed in more detail in the following section.

<h2 id=using_worker_threads_or_processes>6.2. Using Worker Threads or Processes </h2>

<p>Usually, the database file is updated on disk from within calls to
write functions - lsm_insert(), lsm_delete(), lsm_delete_range() and
lsm_commit(). This means that occasionally, a call to one of these four 
functions may take signicantly longer than usual as it pauses to write or
sync the database file. <a href=#automatic_work_and_checkpoint_scheduling>See
below</a> for details.

<p>The alternative to updating the database file from within calls to write
functions is to use one or more background threads or processes to perform the
actual work of writing to the database file. 

<ul>
  <li> Have all client connections set the LSM_CONFIG_AUTOWORK option to 0.
       This stops them from writing to the database file.

  <li> Arrange for a background thread or process to connect to the database
       and call the <a href=lsmapi.wiki#lsm_work>lsm_work()</a> function 
       periodically.
       below</a>).
</ul>

<p>Further explanation of, and example code for, the above is
<a href=#explicit_work_and_checkpoint_scheduling>available below</a>.

<p>The following sub-sections provide a high-level description of the
LSM database architecture and descriptions of how various parameters 
affect the way clients write to the database file. While this may seem
in some ways an inconvenient amount of detail, it may also be helpful
when diagnosing performance problems or analyzing benchmark results.


<h3 id=architectural_overview>6.2.1. Architectural Overview </h3>




<p> The LSM library implements two separate data structures that are used 
together to store user data. When the database is queried, the library 
actually runs parallel queries on both of these data stores and merges the
results together to return to the user. The data structures are:

<ul>
  <li> The <b>in-memory tree</b>. The in-memory tree is an append-only b-tree
       variant designed to be stored entirely in main-memory (i.e. not 
       written out to disk). The library strives to limit the total size of 
       the in-memory tree (by default to 1MB in total).

    <p>At any one time, there may actually be two in-memory tree structures
       present in memory. One immutable tree marked as "old" waiting to be
       written into the database file (see below) and one "live" tree to 
       which new data may be appended.

  <li> <p>The <b>log-structured-merge tree</b> structure for which LSM is 
       named. This data structure is used to store data within the database 
       file on disk.  

       <p>The log-structured-merge tree is made up of a series of "segments".
       Each segment is an immutable tree structure stored (more or less)
       contiguously within the database file. When the database is queried, the
       library runs parallel queries on each of the segments in the database
       and merges the results to return to the user.

       <p>The only way to insert new data to the database is to add a new 
       segment. In order to prevent the number of segments from growing too
       large, two or more existing segments may be merged together into a
       single larger segment at any point. Deleting existing key-value pairs
       is accomplished by inserting "delete-keys" into the new segment.

       <p>The log-structured-merge tree structure is described in more detail 
       <i>link to lsm.wiki section here.</i>
</ul>

<p> When a database client writes a transaction to the database, the new
data is inserted into the "live" in-memory tree structure. At the same time, 
the new data is also appended to the log file on disk. The purpose of the log
file is to provide a persistent backup of any data that the system does not
consider to have been safely stored (see below) in the database file. If a
crash or power failure occurs, this data is recovered by reading the log 
file.
 
<p> Once sufficient data has accumulated within the "live" in-memory tree,
it is marked as "old" and a new live tree created. At any point thereafter,
the contents of the old in-memory tree may be used to populate a new segment
within the database file and then discarded. When this happens, the old
in-memory tree is said to have been "flushed to disk". If there is already an
old in-memory tree when the live tree is deemed to have accumulated enough data
to itself become an old tree, the existing old tree must first be flushed to
disk.

<p> The set of segments that make up the log-structured-merge tree at any time
and the order in which they should be queried is termed a "snapshot".  The
header of the database file contains a snapshot. A snapshot is also stored in
main memory. When set of segments that make up the log-structured-merge tree 
is modified, either by flushing an old in-memory tree to disk or by merging 
two or more existing segments, the in-memory snapshot is updated immediately. 
This is the snapshot that database clients use when querying or otherwise
operating on the database.

<p> At any point after the in-memory snapshot has been updated, the in-memory
snapshot may be written into the database file header. This is known as
"checkpointing" the database. Depending on the value of the 
<a href=lsmapi.wiki#LSM_CONFIG_SAFETY>LSM_CONFIG_SAFETY</a> parameter, it may
be necessary to ensure that all segments referenced by the snapshot have been
synced to disk (safely stored on the persistent media such that they will not
be lost if a power failure occurs) before doing so. It is not necessary for
every version of the in-memory snapshot to be checkpointed. The in-memory
snapshot may be modified multiple times between checkpoints.

<p>
Because a checkpointer process is often required to sync the database file
before updating the database header, "checkpointing" often appears to be the
costliest part of transfering data to the database file, at least in terms of
wall-clock time.

<p> Regular database checkpoints are required to ensure that unused space
within the log file and database file can be reused in a timely fashion.
Specifically:

<ul>
  <li> <p>Space within the log file cannot be recycled until the corresponding
       data has been written into a database segment and a checkpoint 
       performed.

  <li> <p>When two or more existing segments are merged within the database
       file, database clients may start using the new, larger, segment
       immediately.  However the space occupied by the original segments may
       not be reused until after a snapshot that refers to the new segment, and
       not the old ones, has been checkpointed.
</ul>

<p>In other words, without checkpoints the system will function, but both the
log and database files will grow indefinitely as the database is modified 
(even if the size of the dataset remains constant). Additionally, if a crash
or power failure occurs, the next client to open the database file has to
process all data written to the log file since the most recent checkpoint. If
checkpoints are performed infrequently, this can be a time consuming exercise.

<p>In order to safely write data into the in-memory tree (by calling 
lsm_insert, lsm_delete or lsm_delete_range), the database client must hold
the database WRITER lock. At most one client may concurrently hold the WRITER 
lock. Holding the WRITER lock is synonymous with having an open write
transaction - the client obtains the WRITER lock when the transaction is 
opened and relinquishes it when the transaction is closed.

<p>As well as the WRITER lock, there are two other locks that may be held by
at most one client at any time - the WORKER and CHECKPOINTER locks. The roles
of the three locks are roughly as follows:

<table valign=top>
<tr><td valign=top>WRITER<td style="width:3ex"><td>
The WRITER lock is required to modify the contents of the in-memory tree.
Including marking an in-memory tree as "old" and starting a new live tree.
It is also required to write to the log file.

<tr><td valign=top>WORKER<td><td>
The WORKER lock is required to write to segments within the database file.
Either when merging two or more existing segments within the database, or
when flushing an in-memory tree to disk to create a new segment.
The WORKER lock is also required to update the database snapshot stored in
main memory (updated so that new clients will use the new segments the worker
creates).

<tr><td valign=top>CHECKPOINTER<td><td>
The CHECKPOINTER lock is required to update the snapshot stored in the 
database file header (to checkpoint the database). 
</table>

<p>The tasks associated with each of the locks above may be performed
concurrently by multiple database connections, located either in the same
application process or different processes.

<h3 id=automatic_work_and_checkpoint_scheduling>6.2.2. Automatic Work and Checkpoint Scheduling</h3>

<p>By default, database "work" (the flushing and merging of segments, performed
by clients holding the WORKER lock) and checkpointing are scheduled and
performed automatically from within calls to "write" API functions. The 
"write" functions are:

<ul>
  <li>lsm_insert()
  <li>lsm_delete()
  <li>lsm_delete_range()
  <li>lsm_commit()
</ul>

<p>It is expected that automatic work and checkpoint scheduling will be
suitable for most applications. The advantage of this model is that it is
simple to use. However, any call to one of the functions listed above may be
co-opted by the system to perform database work or a checkpoint, causing it to
return more slowly than it otherwise would. In some situations, for example
when a writer thread is also responsible for handling user-interface events,
this may be undesirable.

<p>Automatic work and checkpoint scheduling is controlled by four integer
parameters set or queried using the lsm_config() interface. The parameters
are:

<dl>
  <dt> LSM_CONFIG_AUTOWORK
  <dd> <p style=margin-top:0>
       This is a boolean parameter (default 1). If set, auto-work mode is
       enabled for the database connection. Otherwise, it is not.


  <dt> LSM_CONFIG_AUTOFLUSH
  <dd> <p style=margin-top:0>
       An integer parameter (default 1048576). If this parameter is set
       to a non-zero value, then after each transaction is committed, the
       library checks the total size of the live in-memory tree. If it is
       equal to or greater than the configured value of this parameter in
       bytes and there is no existing old in-memory tree, then the current 
       live tree is marked as old.






       <p>Additionally, if the LSM_CONFIG_AUTOWORK parameter is set, the 
       contents of the in-memory tree are immediately flushed to disk.

  <dt> LSM_CONFIG_AUTOMERGE
  <dd> <p style=margin-top:0>
       This parameter must be set to an integer value between 2 and 8,
       inclusive. It controls the number of existing segments that 
       auto-work attempts to merge together at a time. The default value is 4.

  <dt> LSM_CONFIG_AUTOCHECKPOINT
  <dd> <p style=margin-top:0>
       If this parameter is set to an integer value greater than 0, then
       a checkpoint is automatically attempted after this many bytes are
       written into the database file. The default value is 2097152.
</dl>

<p>Each segment in the database file is assigned an "age" - an integer zero
or greater indicating how many times the data in the segment has been merged.
A segment created by flushing the in-memory tree to disk is assigned an age
of 1. When two or more segments with age=1 are merged together to create a
larger segment, it is assigned an age of 2. And so on.

<p>If auto-work is enabled, the library periodically checks the state of the
database file to see if there exist N or more segments with the same age
value A, where N is the value assigned to the LSM_CONFIG_AUTOMERGE parameter.
If so, work is done to merge all such segments with age=A into a new, larger
segment assigned age=A+1. At present, "periodically" as used above means 
roughly once for every 32KB of data (including overhead) written to the
in-memory tree. The merge operation is not necessarily completed within 
a single call to a write API (this would result in blocking the writer thread
for too long in many cases - in large databases segments may grow to be many GB
in size). Currently, the amount of data written by a single auto-work
operation is roughly 32KB multiplied by the number of segments in the database
file. This formula may change - the point is that the library attempts to limit
the amount of data written in order to avoid blocking the writer thread for too
long within a single API call.

<p>Each time a transaction is committed in auto-work mode, the library checks
to see if there exists an "old" in-memory tree (see the LSM_CONFIG_AUTOFLUSH
option above). If so, it attempts to flush it to disk immediately. Unlike
merges of existing segments, the entire in-memory tree must be flushed to
disk before control is returned to the user. It is not possible to
incrementally flush an in-memory tree in the same ways as it is possible to
incrementally merge existing database segments together.



<p>In order to perform auto-work on the database file, either to flush the
contents of an old in-memory tree to disk or to merge existing segments within

the database file together, the client must obtain the WORKER lock. If some
other client is already holding the WORKER lock, this will not be possible.
This is not an error. If this occurs, the writer thread simply returns
immediately, without performing any work on the database file.




<p>Assuming the LSM_CONFIG_AUTOCHECKPOINT parameter is set to a value greater
than zero, after performing database work, the library automatically checks
how many bytes of raw data have been written to the database file since the
last checkpoint (by any client, not just by the current client). If this
value is greater than the value of the LSM_CONFIG_AUTOCHECKPOINT parameter,
a checkpoint is attempted. It is not an error if the attempt fails because the
CHECKPOINTER lock cannot be obtained.

<h3 id=explicit_work_and_checkpoint_scheduling>6.2.3. Explicit Work and Checkpoint Scheduling</h3>



<p>The alternative to automatic scheduling of work and checkpoint operations
is to explicitly schedule them. Possibly in a background thread or dedicated
application process. In order to disable automatic work, a client must set
the LSM_CONFIG_AUTOWORK parameter to zero. This parameter is a property of
a database connection, not of a database itself, so it must be cleared
separately by all processes that may write to the database. Otherwise, they
may attempt automatic database work or checkpoints.

<verbatim>
................................................................................
  int iVal = 0;
  lsm_config(db, LSM_CONFIG_AUTOWORK, &iVal);
</verbatim>

<p>The lsm_work() function is used to explicitly perform work on the database:

<verbatim>
  int lsm_work(lsm_db *db, int nMerge, int nByte, int *pnWrite);
</verbatim>

<p>Parameter nByte is passed a limit on the number of bytes of data that 
should be written to the database file before the call returns. It is a 
hint only, the library does not honor this limit strictly.

<p>If the database has an old in-memory tree when lsm_work() is called, it is
flushed to disk. If this means that more than nByte bytes of data is written
to the database file, no further work is performed. Otherwise, the number
of bytes written is subtracted from nByte before proceeding.

<p>If parameter nMerge is greater than 1, then the library searches for 
nMerge or more segments of the same age within the database file and performs
up to nByte bytes of work to merge them together. If the merge is completed
before the nByte limit is exceeded, the library searches for another set of
nMerge or more segments to work on, and so on. If at any point no such set of
nMerge segments can be found, the call returns without performing any 
further work.

<p>Calling lsm_work() with the nMerge argument set to 1 is used to "optimize"
the database (see below). Passing a value of zero or less for the nMerge
parameter is an error.

<p>In any case, before returning the value of *pnWrite is set to the actual
number of bytes written to the database file.

<p>The example code below might be executed in a background thread or process
in order to perform database work and checkpointing. In this case all other
clients should set the LSM_CONFIG_AUTOWORK parameter to zero.

<verbatim>
  int rc;
................................................................................
    ** time, things may have changed (the other process may have relinquished
    ** the WORKER lock, or an in-memory tree may have been marked as old).
    */
    if( nWrite==0 ) sleep(1);
  }
</verbatim>

<p>Checkpoints can also be requested explicitly, using the lsm_checkpoint()
API:

















<verbatim>
  int lsm_checkpoint(lsm_db *db, int *pnCkpt);
</verbatim>

<p>If no work has been performed on the database since the most recent
checkpoint (implying that the snapshot has not changed and there is no need
to write it into the database file), lsm_checkpoint() sets *pnCkpt to zero
and returns immediately. Otherwise, it checkpoints the database and sets
*pnCkpt to the number of bytes written to the database file since the
previous checkpoint.


<p>The number of bytes written to the database since the most recent checkpoint
can also be using the <a href=lsmapi.wiki#lsm_info>lsm_info()</a> API 
function. As follows:

<verbatim>
  int nCkpt;
  rc = lsm_info(db, LSM_INFO_CHECKPOINT_SIZE, &nCkpt);
</verbatim>






<verbatim>
  int nOld, nLive;
  rc = lsm_info(db, LSM_INFO_TREE_SIZE, &nOld, &nLive);
</verbatim>

<h3 id=compulsary_work_and_checkpoint_scheduling>6.2.4. Compulsary Work and Checkpoint Scheduling</h3>

<p>Apart from the scenarios described above, there are two scenarios where
database work or checkpointing may be performed automatically, regardless of
the value of the LSM_CONFIG_AUTOWORK parameter.


<ul>
  <li> When closing a database connection, and 
  <li> When the number of segments with a common age in the database file grows


       unacceptably high.
</ul>

<p>Whenever an lsm_close() call would mean that the total number of 
connections to a database drops to zero, the connection checks if the 
in-memory tree is empty. If not, it is flushed to disk. Both the live and 
old in-memory trees are flushed to disk in this case. It also checks if the
database file has been modified since the most recent checkpoint was 
performed. If so, it also performs a checkpoint. And, assuming no error
has occurred, deletes the log file.

<p>Additionally, whenever a worker wishes to flush an in-memory tree to a new
age=1 segment, it must first ensure that there are less than N existing age=1
segments, where N is the value that the LSM_CONFIG_AUTOMERGE parameter is
set to. If there are already N or more age=1 segments, they must be merged
into an age=2 segment before a new age=1 segment can be created within the
database file. Similar rules apply to segments of other ages - it is not
possible to create a new age=I segment if there are already N segments with
age=I in the database file. This has two implications:

<ul>
  <li> The database is prevented from accumulating too many segments,
       regardless of whether or not auto-work is enabled or how infrequently
       lsm_work() is called, and

  <li> If auto-work is disabled and lsm_work() is not called frequently enough,
       it is possible that flushing an in-memory tree may required writing a
       tremendous amount of data to disk (possibly even rewriting the entire
       database file).
</ul>

<p>Finally, regardless of age, a database is limited to a maximum of 64
segments in total. If an attempt is made to flush an in-memory tree to disk
when the database already contains 64 segments, two or more existing segments
must be merged together before the new segment can be created.

<h2 id=database_file_optimization>6.3. Database File Optimization</h2>

<p>Database optimization transforms the contents of database file so that
the following are true:

<ul>
  <li> <p>All database content is stored in a single 
       <a href=#architectural_overview>segment</a>. This makes the
................................................................................
       to be visted when searching the database.

  <li> <p>The database file contains no (or as little as possible) free space.
       In other words, it is no larger than required to contain the single
       segment.
</ul>

<p><i> Should we add a convenience function lsm_optimize() that does not 
return until the database is completely optimized? One that more or less does
the same as the example code below and deals with the AUTOCHECKPOINT issue?
This would help with this user manual if nothing else, as it means a method
for database optimization can be presented without depending on the previous
section.

</i>

<p>In order to optimize the database, lsm_work() should be called repeatedly
with the nMerge argument set to 1 until it returns without writing any data

to the database file. For example:

<verbatim>
  int nWrite;
  int rc;
  do {
    rc = lsm_work(db, 1, 2*1024*1024, &nWrite);
  }while( rc==LSM_OK && nWrite>0 );
</verbatim>

<p>When optimizing the database as above, either the LSM_CONFIG_AUTOCHECKPOINT
parameter should be set to a non-zero value or lsm_checkpoint() should be
called periodically. Otherwise, no checkpoints will be performed, preventing
the library from reusing any space occupied by old segments even after their
content has been merged into the new segment. The result - a database file that
is optimized, except that it is up to twice as large as it otherwise would be.











>
>
>
>
>












|
|
|
|
|
|
|





|
|
|







 







|

|
|







 







|
|
|
|

|






|
|
|












|
|






<
<
|









|
|





|
|
|
|
|
<







 







<
|
|
|







 







|

|
|
|







 







|
|
|
|
|
|
|
|
>
>
>



>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|

|




|
|







 







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







 







|
|
|
|
|
|




>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
|



|
|
|
|
|









<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<


<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
|

<
<
<
<
<
>

<
>
>
>

<
<
<
<
<

<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<






<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
>
|
<
<
<
<
<
<
<
>
>
>
>
>

<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
<
|
|
|
|
|

|
|


|

|
|
|
|
|
|
|
|

<
<
<
<
<
<
<
>
>

<
<
>
|
<
<
<
>
>
>

<
<
<
<
<
<
<

<

>
>

|







 







|


|




|

|



|
|









|







 







|
<
>
>
>
>
>
>

>
>
>
>
>
>
>
>
>
>

|


|
<
|
|
|
<

>
|
|
<






>
>
>
>
>





|

<
|
|
>



|
>
>








|
|


|
|
|


|
|

|
|
|
<
<
<
|
|
<
<






|







 







<
<
<
<
<
<
<
<
<
|
|
>
|


<
<
<
|
<










>
>
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
..
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
..
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120


121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142

143
144
145
146
147
148
149
...
190
191
192
193
194
195
196

197
198
199
200
201
202
203
204
205
206
...
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
...
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
...
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
...
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039











































1040
1041



























1042
1043





1044
1045

1046
1047
1048
1049





1050









































































































































1051
1052
1053
1054
1055
1056


















1057
1058







1059
1060
1061
1062
1063
1064
















1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085







1086
1087
1088


1089
1090



1091
1092
1093
1094







1095

1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
....
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
....
1186
1187
1188
1189
1190
1191
1192
1193

1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215

1216
1217
1218

1219
1220
1221
1222

1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240

1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272



1273
1274


1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
....
1291
1292
1293
1294
1295
1296
1297









1298
1299
1300
1301
1302
1303



1304

1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
<nowiki>

<h2>Table of Contents</h2>










<div id=start_of_toc></div>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#introduction_to_lsm style=text-decoration:none>1. Introduction to LSM</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#using_lsm_in_applications style=text-decoration:none>2. Using LSM in Applications </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#basic_usage style=text-decoration:none>3. Basic Usage</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#opening_and_closing_database_connections style=text-decoration:none>3.1. Opening and Closing Database Connections </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#writing_to_a_database style=text-decoration:none>3.2. Writing to a Database </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#reading_from_a_database style=text-decoration:none>3.3. Reading from a Database </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#database_transactions_and_mvcc style=text-decoration:none>3.4. Database Transactions and MVCC </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#data_durability style=text-decoration:none>4. Data Durability </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#compressed_and_encrypted_databases style=text-decoration:none>5. Compressed and Encrypted Databases </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#performance_tuning style=text-decoration:none>6. Performance Tuning</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#overview_of_lsm_architecture style=text-decoration:none>6.1. Overview of LSM Architecture</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#performance_related_configuration_options style=text-decoration:none>6.2. Performance Related Configuration Options </a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#work_and_checkpoint_scheduling style=text-decoration:none>6.3. Work and Checkpoint Scheduling</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#automatic_scheduling style=text-decoration:none>6.3.1. Automatic Scheduling</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#explicit_scheduling style=text-decoration:none>6.3.2. Explicit Scheduling</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#compulsary_work_and_checkpoints style=text-decoration:none>6.3.3. Compulsary Work and Checkpoints</a><br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<a href=#database_file_optimization style=text-decoration:none>6.4. Database File Optimization</a><br>

<div id=end_of_toc></div>

<h2>Overview</h2>

<p>This document describes the LSM embedded database library and use thereof.
It is part user-manual and part tutorial. It is intended to complement the 
<a href=lsmapi.wiki>LSM API reference manual</a>.

<p>The <a href=#introduction_to_lsm>first section</a> of this document contains
a description of the LSM library and its features. 
<a href=#using_lsm_in_applications>Section 2</a> describes how to use LSM from
within a C or C++ application (how to compile and link LSM, what to #include
etc.). The <a href=#basic_usage>third section</a> describes the essential APIs
that applications use to open and close database connections, and to read from
................................................................................

<p>The three sections described above contain all the information required to
create applications that use LSM. The remaining sections discuss more
specialized topics. <a href=#data_durability>Section 4</a> discusses the
configuration parameter that influences transaction durability (the guarantees
offered with respect to recently committed transactions if a power failure 
occurs). <a href=#compressed_and_encrypted_databases>Section 5</a> explains
the interface provided by LSM that allows external data compression and/or
encryption functions to be used to create compressed and/or encrypted
databases. <a href=#performance_tuning>Section 6</a> deals with performance 
tuning.

<h1 id=introduction_to_lsm>1. Introduction to LSM</h1>

<p>LSM is an embedded database library for key-value data, roughly similar
in scope to
<a href="http://www.oracle.com/technetwork/products/berkeleydb/overview/index.html">Berkeley DB</a>, 
<a href="http://code.google.com/p/leveldb/">LevelDB</a> or
................................................................................
  <li> Iterating through a range of database keys (either forwards or
       backwards).
</ul>

<p>Other salient features are:

<ul>
  <li><p>A <b>single-writer/multiple-reader MVCC</b> based transactional
      concurrency model. SQL style nested sub-transactions are supported. 
      Clients may concurrently access a single LSM database from within a 
      single process or multiple application processes. 

  <li><p>An entire database is stored in a <b>single file on disk</b>. 

  <li><p>Data <b>durability in the face of application or power failure</b>.
      LSM may optionally use a write-ahead log file when writing to the
      database to ensure committed transactions are not lost if an application
      or power failure occurs.

  <li>An API that <b>allows external data compression and/or encryption 
      routines to be used </b> to create and access compressed and/or 
      encrypted databases.
</ul>
    

<p>Many database systems that support range queries, including <a
href=http://www.sqlite.org>SQLite 3</a>, Berkeley DB and Kyoto Cabinet, are
based on one of many variants of the 
<a href="http://en.wikipedia.org/wiki/B-tree">b-tree data structure</a>.
B-trees are attractive because a b-tree structure minimizes the number of disk
sectors that must be read from disk when searching the database for a specific
key. However, b-tree implementations usually suffer from poor write
localization - updating the contents of a b-tree often involves modifying the
contents of nodes scattered throughout the database file. If the database is
stored on a spinning disk (HDD), then the disk heads must be moved between
writing non-contiguous sectors, which is extremely slow. If the database is
stored on solid state storage (SDD) a similar phenomena is encountered due to
the large erase-block sizes. In general, writing to a series of contiguous disk
sectors is orders of magnitude faster than updating to the same number of disk
sectors scattered randomly throughout a large file. Additionally, b-tree
structures are prone to fragmentation, reducing the speed of range queries.



<p><i>TODO: fix the link in the next paragraph to point to a description
of the log-structured-merge tree within lsm.wiki (or its successor).</i>

<p>LSM uses a <a href=lsm.wiki>different data structure</a> that makes the
following performance tradeoffs relative to a b-tree:

<ul>
  <li> A very large percentage of the disk sectors modified when writing to
       the database are contiguous.
       Additionally, in many cases the total number of sectors written
       to disk is reduced. This makes writing to an LSM database faster 
       than the equivalent b-tree.

  <li> LSM databases do not suffer from fragmentation to the same degree
       as b-trees. This means that the performance of large range queries 
       does not degrade as the database is updated as it may with a b-tree.

  <li> Under some circumstances searching an LSM database for a given key will
       involve examining more disk sectors than it would with a b-tree. In
       terms of disk sectors accessed when searching a database of size N, 
       both b-trees and LSM provide O(log(N)) efficiency, but the base of 
       the logarithm is generally larger for a b-tree than for LSM.

</ul>

<p>In other words, writing to an LSM database should be very fast and scanning
through large ranges of keys should also perform well, but searching the
database for specific keys may be slightly slower than when using a b-tree
based system. Additionally, avoiding random writes in favour of largely
contiguous updates (as LSM does) can significantly reduce the wear on SSD or
................................................................................
a pointer to a <a href=lsmapi.wiki#lsm_env>database environment object</a>
or NULL. Almost all applications should pass NULL. A database environment
object allows the application to supply custom implementations of the various
operating system calls that LSM uses to read and write files, allocate heap 
memory, and coordinate between multiple application threads and processes.
This is normally only required if LSM is being used on a platform that is not
supported by default. Passing NULL instructs the library to use the default

implementations of all these things. The second argument to lsm_new() is an
output variable. Assuming the call is successful, *pDb is set to point to the
new database handle before returning.

<p>The first argument passed to lsm_open() must be an existing database 
handle. The second is the name of the database file to connect to. Once
lsm_open() has been successfully called on a database handle, it can not be
called again on the same handle. Attempting to do so is an LSM_MISUSE error.

<p>For example, to create a new handle and connect it to database "test.db"
................................................................................

<verbatim>
  rc = lsm_close(db);
</verbatim>

<p>It is important that lsm_close() is called to close all database handles
created with lsm_new(), particularly if the connection has written to the
database. If an application writes to the database and then exits without
closing its database connection, then subsequent clients may have to run
"database recovery" when they open the database, slowing down the lsm_open() 
call. Additionally, not matching each successful lsm_new() call with a call 
to lsm_close() is a resource leak.

<p>Counter-intuitively, an lsm_close() call may fail. In this case the database
handle is not closed, so if the application exits it invites the "database
recovery" performance problem mentioned above. The usual reason for an
lsm_close() call failing is that the database handle has been used to create
<a href=lsmapi.wiki#lsm_csr_open>database cursors</a> that have not been 
closed. Unless all database cursors are closed before lsm_close() is called,
................................................................................
</i>


<h1 id=performance_tuning>6. Performance Tuning</h1>

<p> This section describes the various measures that can be taken in order to
fine-tune LSM in order to improve performance in specific circumstances.
Sub-section 6.1 contains a high-level overview of the 
<a href=#overview_of_lsm_architecture>system architecture</a>
intended to help in understanding the various performance tradeoffs and
optimizations available to LSM applications. Sub-section 6.2 identifies the 
<a href=#performance_related_configuration_options> configuration parameters</a>
that can be used to influence database performance. 
Sub-section 6.3 discusses options and methods for scheduling the time-consuming
processes of actually 
<a href=#work_and_checkpoint_scheduling>writing and syncing the database file
</a>to disk. Finally, 6.4 introduces "<a
href=#database_file_optimization>database optimization</a>"
- the process of reorganizing a database file internally so that it is as small
as possible and optimized for search queries.

<h2 id=overview_of_lsm_architecture>6.1. Overview of LSM Architecture</h2>

<p>The following steps describe the journey taken by data written to the 
   database from the application to the database file on disk:

<ol>
  <li> <p>
       When an application writes to an LSM database, the new data is first
       written to a log file and stored in an in-memory tree structure. The
       log file is used for database recovery - if an application crash or
       power failure occurs and the contents of the in-memory tree are lost,
       data is recovered by reading the log file.

  <li> <p>
       Once sufficient data has been accumulated in an in-memory tree (by
       default "sufficient data" means 1MB, including data structure 
       overhead), it is marked as "old" and a new "live" in-memory tree 
       created. An old in-memory tree is immutable - new data is always
       inserted into the live tree. There may be at most one old tree
       in memory at any time.

  <li> <p>
       The contents of an old in-memory tree may be written into the 
       database file at any point. Once its contents have been written (or
       "flushed") to the database file, the in-memory tree may be discarded.
       Flushing an in-memory tree to the database file creates a new database
       "segment". A database segment is an immutable b-tree structure stored
       within the database file. A single database file may contain up to 64 
       segments.

  <li> <p>
       At any point, two or more existing segments within the database may
       be merged together into a single segment. Once their contents has
       been merged into the new segment, the original segments may be 
       discarded.

  <li> <p>
       After the set of segments in a database file has been modified (either
       by flushing an in-memory tree to disk or by merging existing segments
       together), the changes may be made persistent by "checkpointing" the 
       database. Checkpointing involves syncing the contents of the database
       file to disk and updating the database file header.
</ol>

<p>Steps 3 and 4 above are known as "working" on the database. Step 5 is
refered to as "checkpointing". By default, database connections perform work
and checkpoint operations periodically from within calls to API functions
<code>lsm_insert</code>, <code>lsm_delete</code>, <code>lsm_delete_range</code>
and <code>lsm_commit</code> (i.e. functions that write to the database).
Alternatively, work and checkpoint operations may be performed on demand using
the <code>lsm_work</code> and <code>lsm_checkpoint</code> APIs. By opening a
second database connection, these operations may be moved to a background
thread or process. 

<p>Optimizing database write throughput and responsiveness is done by
configuring and scheduling work and checkpoint operations, and by configuring 
a few other parameters, as described in the following two sections.

<p>The speed of database read operations is largely determined by the number
of segments in the database file. So optimizing read operations is also linked 
to the configuring and scheduling of database write operations, as these
policies determine the number of segments that are present in the database
file at any time.

<p>Any data written to the database file since the last checkpoint may be
lost if a power or application failure occurs. As a result of this, regular
database checkpoints are required to ensure that unused space within the log
file and database file can be reused in a timely fashion. Specifically:

<ul>
  <li><p>Space within the log file cannot be recycled until the corresponding
         data has been written into a database segment and a checkpoint
         performed.

  <li><p>When two or more existing segments are merged into a new segment
         within the database file, the space occupied by the original segments
         may not be recycled until after a checkpoint has been performed.
</ul>

<p>
In other words, without checkpoints the system will function, but both the log and database files will grow indefinitely as the database is modified (even if the size of the dataset remains constant). Additionally, if a crash or power failure occurs, the next client to open the database file has to process all data written to the log file since the most recent checkpoint. If checkpoints are performed infrequently, this can be a time consuming exercise. 

<p>If a connection attempts to open a write transaction on the database when
another connection already has an open write transaction, the attempt fails
and LSM_BUSY is returned to the caller. This is because to write to the
database, the connection must first obtain the WRITER lock - and at most 
one connection may simultaneously hold the WRITER lock. As well as the WRITER
lock, there are two other exclusive locks that may be obtained on the database
- the WORKER and CHECKPOINTER locks. These are used, not surprisingly, to
ensure that at most one connection attempts to work on or checkpoint the
database at a time. More specifically, the roles of the three locks are:

<table valign=top>
<tr><td valign=top>WRITER<td style="width:3ex"><td>
The WRITER lock is required to modify the contents of the in-memory tree.
Including marking an in-memory tree as "old" and starting a new live tree.
It is also required to write to the log file.

<tr><td valign=top>WORKER<td><td>
The WORKER lock is required to write to segments within the database file.
Either when merging two or more existing segments within the database, or
when flushing an in-memory tree to disk to create a new segment.

<tr><td valign=top>CHECKPOINTER<td><td>
The CHECKPOINTER lock is required to write to the database file header.
</table>

<p>The three locks are independent. It is possible to simultaneously have one
client writing to the database, another working on the database file and a
third performing a checkpoint operation.


<h2 id=performance_related_configuration_options>6.2. Performance Related Configuration Options </h2>

<p>The options in this section are all set to integer values. They may be 
set and queried using the <a href=lsmapi.wiki#lsm_config>lsm_config()</a>
function. To set an option to a value, lsm_config() is used as follows:

<verbatim>
  /* Set the LSM_CONFIG_AUTOFLUSH option to 1MB (1024 KB) */
  int iVal = 1024;
  rc = lsm_config(db, LSM_CONFIG_AUTOFLUSH, &iVal);
</verbatim>

<p>In order to query the current value of an option, the initial value of
the parameter (iVal in the example code above) should be set to a negative
value. Or any other value that happens to be out of range for the parameter -
negative values just happen to be out of range for all integer lsm_config()
................................................................................
<verbatim>
  /* Set iVal to the current value of LSM_CONFIG_AUTOFLUSH */
  int iVal = -1;
  rc = lsm_config(db, LSM_CONFIG_AUTOFLUSH, &iVal);
</verbatim>

<dl>
  <dt> <a href=lsmapi.wiki#LSM_CONFIG_AUTOCHECKPOINT>LSM_CONFIG_AUTOCHECKPOINT</a>
  <dd> <p style=margin-top:0>
    This option determines how often the database is checkpointed (synced to
    disk). A checkpoint is performed after N KB (approximately) have been
    written to the database file, where N is the value of this option. 
    Increasing this value (say to 4MB or even 8MB) may improve overall write
    throughput. 

    <p>The default value is 2048 (2MB).

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_AUTOFLUSH>LSM_CONFIG_AUTOFLUSH</a>
  <dd> <p style=margin-top:0>
    This option determines how much data in KB is allowed to accumulate in a
    live in-memory tree before it is marked as "old" (and made eligible to be
    flushed through to the database file). Increasing this value may improve 
    overall write throughput. Decreasing it reduces memory usage.

    <p>The default value is 1024 (1MB).

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_AUTOMERGE>LSM_CONFIG_AUTOMERGE</a>
  <dd> <p style=margin-top:0>
    <p>If auto-work (the LSM_CONFIG_AUTOWORK option below) is enabled, then 
    this option is set to the number of segments that the library attempts 
    to merge simultaneously. Increasing this value may reduce the total
    amount of data written to the database file. Decreasing it increases
    the amount of data written to the file, but also decreases the average
    number of segments present in the file, which can improve the performance
    of database read operations.

    <p><span style=color:red>If auto-work is not enabled...</span>

    <p>The default value is 4. This option must be set to a value between
    2 and 8, inclusive.

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_AUTOWORK>LSM_CONFIG_AUTOWORK</a>
  <dd> <p style=margin-top:0>
    <p>This option may be set to either 1 (true) or 0 (false). If it is set
    to true, then work and checkpoint operations are automatically scheduled
    within calls to lsm_insert(), lsm_delete(), lsm_delete_range() and
    lsm_commit(). Otherwise, if it is set to false, these operations must
    be explicitly invoked by the application. See <span style=color:red>some
    link here</span> for details.

    <p>The default value is 1.

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_MMAP>LSM_CONFIG_MMAP</a>
  <dd> <p style=margin-top:0>
    If LSM is running on a system with a 64-bit address space, this option
    may be set to either 1 (true) or 0 (false). On a 32-bit platform, it is
    always set to 0.
    <p> If it is set to true, the entire database file is memory mapped. Or, if
    it is false, data is accessed using ordinary OS file read and write
................................................................................
    examined. 
    <p>This option can only be set before lsm_open() is called on the database
    connection.
    <p>The default value is 1 (true) on a 64-bit platform, and 0 otherwise.

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_MULTIPLE_PROCESSES>LSM_CONFIG_MULTIPLE_PROCESSES</a>
  <dd> <p style=margin-top:0>
    This option may also be set to either 1 (true) or 0 (false).  The default
    value is 1 (true). If it is set to false, then the library assumes that all
    database clients are located within the same process (have access to the
    same memory space). Assuming this means the library can avoid using OS file
    locking primitives to lock the database file, which speeds up opening and
    closing read and write transactions. 

    <p>This option can only be set before lsm_open() is called on the database
    connection.

    <p>If this option is set to false and there is already a connection to the
    database from another process when lsm_open() is called, the lsm_open()
    call fails with error code LSM_BUSY. <span style=color:red>todo: It 
    doesn't actually do this yet. But it should...</span>

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_SAFETY>LSM_CONFIG_SAFETY</a>
  <dd> <p style=margin-top:0>
    The effect of this option on <a href=#data_durability>data durability</a>
    is described above.

    <p>From a performance point of view, this option determines how often the
    library pauses to wait for data written to the file-system to be stored
    on the persistent media (e.g. hard disk or solid-state memory). This is
    also known as "syncing" data to disk. Since this is orders of magnitude
    slower than simply copying data into operating system buffers, the value
    of this option has a large effect on write performance.
  
    <p>If LSM_CONFIG_SAFETY is set to 2 (FULL), then the library syncs the
    data written to the log file to disk whenever a transaction is committed.
    Or, if LSM_CONFIG_SAFETY is set to 1 (NORMAL), then data is only synced
    to disk when a checkpoint is performed (see above). Finally, if it is set
    to 0 (OFF), then no data is ever synced to disk.

    <p>The default value is 1 (NORMAL).

  <dt> <a href=lsmapi.wiki#LSM_CONFIG_USE_LOG>LSM_CONFIG_USE_LOG</a>
  <dd> <p style=margin-top:0>
    This is another option that may be set to either 1 (true) or 0 (false). 
    The default value is 1 (true). If it is set to false, then the library 
    does not write data into the database log file. This makes writing faster,
    but also means that if an application crash or power failure occurs, it is
    very likely that any recently committed transactions will be lost.

    <p>If this option is set to true, then an application crash cannot cause
    data loss. Whether or not data loss may occur in the event of a power
    failure depends on the value of the <a href=#data_durability>
    LSM_CONFIG_SAFETY</a> parameter.

    <p>This option can only be set if the connection does not currently have
    an open write transaction.












































</dl>




























<h2 id=work_and_checkpoint_scheduling>6.3. Work and Checkpoint Scheduling</h2>






<h3 id=automatic_scheduling>6.3.1. Automatic Scheduling</h3>


<p>This section describes how work and checkpoint operations are scheduled 
if the boolean LSM_CONFIG_AUTOWORK parameter is set to true. Automatic
work operations may occur within calls to any of the following API functions:






<ul>









































































































































  <li>lsm_insert()
  <li>lsm_delete()
  <li>lsm_delete_range()
  <li>lsm_commit()
</ul>



















<p>Each time a transaction is committed in auto-work mode, the library checks
to see if there exists an "old" in-memory tree (see the LSM_CONFIG_AUTOFLUSH







option above). If so, it attempts to flush it to disk immediately. Unlike
merges of existing segments, the entire in-memory tree must be flushed to disk
before control is returned to the user. It is not possible to incrementally
flush an in-memory tree in the same ways as it is possible to incrementally
merge existing database segments together. 

















<p>Each segment in the database file is assigned an "age" - an integer zero 
or greater indicating how many times the data in the segment has been merged. 
A segment created by flushing the in-memory tree to disk is assigned an age 
of 1. When two or more segments with age=1 are merged together to create a 
larger segment, it is assigned an age of 2. And so on. 

<p>Assuming auto-work is enabled, the library periodically checks the state of
the database file to see if there exist N or more segments with the same age
value A, where N is the value assigned to the LSM_CONFIG_AUTOMERGE parameter.
If so, work is done to merge all such segments with age=A into a new, larger
segment assigned age=A+1. At present, "periodically" as used above means
roughly once for every 32KB of data (including overhead) written to the
in-memory tree. The merge operation is not necessarily completed within a
single call to a write API (this would result in blocking the writer thread for
too long in many cases - in large databases segments may grow to be many GB in
size). Currently, the amount of data written by a single auto-work operation is
roughly 32KB multiplied by the number of segments in the database file. This
formula may change - the point is that the library attempts to limit the amount
of data written in order to avoid blocking the writer thread for too long
within a single API call. 








<p>Checkpoint operations are scheduled based on the value assigned to the
LSM_CONFIG_AUTOCHECKPOINT configuration parameter.



<p>In order to automatically perform work and checkpoint operations, the 
client must obtain the WORKER and CHECKPOINTER locks, respectively. If an



attempt to obtain either of these locks fails (because some other client is
already holding them), it is not an error, the scheduled work or checkpoint
is simply not performed.











<h3 id=explicit_scheduling>6.3.2. Explicit Scheduling</h3>

<p>The alternative to automatic scheduling of work and checkpoint operations
is to explicitly schedule them - possibly in a background thread or dedicated
application process. In order to disable automatic work, a client must set
the LSM_CONFIG_AUTOWORK parameter to zero. This parameter is a property of
a database connection, not of a database itself, so it must be cleared
separately by all processes that may write to the database. Otherwise, they
may attempt automatic database work or checkpoints.

<verbatim>
................................................................................
  int iVal = 0;
  lsm_config(db, LSM_CONFIG_AUTOWORK, &iVal);
</verbatim>

<p>The lsm_work() function is used to explicitly perform work on the database:

<verbatim>
  int lsm_work(lsm_db *db, int nMerge, int nKB, int *pnWrite);
</verbatim>

<p>Parameter nKB is passed a limit on the number of KB of data that 
should be written to the database file before the call returns. It is a 
hint only, the library does not honor this limit strictly.

<p>If the database has an old in-memory tree when lsm_work() is called, it is
flushed to disk. If this means that more than nKB KB of data is written
to the database file, no further work is performed. Otherwise, the number
of KB written is subtracted from nKB before proceeding.

<p>If parameter nMerge is greater than 1, then the library searches for 
nMerge or more segments of the same age within the database file and performs
up to nKB KB of work to merge them together. If the merge is completed
before the nKB limit is exceeded, the library searches for another set of
nMerge or more segments to work on, and so on. If at any point no such set of
nMerge segments can be found, the call returns without performing any 
further work.

<p>Calling lsm_work() with the nMerge argument set to 1 is used to "optimize"
the database (see below). Passing a value of zero or less for the nMerge
parameter is an error.

<p>In any case, before returning the value of *pnWrite is set to the actual
number of KB written to the database file.

<p>The example code below might be executed in a background thread or process
in order to perform database work and checkpointing. In this case all other
clients should set the LSM_CONFIG_AUTOWORK parameter to zero.

<verbatim>
  int rc;
................................................................................
    ** time, things may have changed (the other process may have relinquished
    ** the WORKER lock, or an in-memory tree may have been marked as old).
    */
    if( nWrite==0 ) sleep(1);
  }
</verbatim>

<p>The mechanism associated with the LSM_CONFIG_AUTOCHECKPOINT configuration

parameter applies to data written by both automatically scheduled work and
work performed by calls to the lsm_work() function. The amount of
uncheckpointed data that has been written into the database file is a property
of the database file, not a single connection, so checkpoints occur at the
configured interval even if multiple connections are used to work on the
database.

<p>Alternatively, checkpoint operations may be scheduled separately. If the
LSM_CONFIG_AUTOCHECKPOINT parameter is set to zero, then a connection never
performs a database checkpoint, regardless of how much data it or any other 
connection writes into the database file. As with LSM_CONFIG_AUTOWORK, this
parameter must be zeroed for all connections that may perform work on the
database. Otherwise, they may perform a checkpoint operation.

<p>The <a href=lsmapi.wiki#lsm_checkpoint>lsm_checkpoint()</a> API is used
to expicitly request a checkpoint.

<verbatim>
  int lsm_checkpoint(lsm_db *db, int *pnKB);
</verbatim>

<p>If no work has been performed on the database since the previous

checkpoint, lsm_checkpoint() sets *pnKB to zero and returns immediately. 
Otherwise, it checkpoints the database and sets *pnKB to the number of KB of
data written to the database file since the previous checkpoint.


<p>A database may be queried for the number of KB written to the database 
since the most recent checkpoint using the 
<a href=lsmapi.wiki#lsm_info>lsm_info()</a> API function. As follows:


<verbatim>
  int nCkpt;
  rc = lsm_info(db, LSM_INFO_CHECKPOINT_SIZE, &nCkpt);
</verbatim>

<p>It may also be queried for the size of the in-memory tree or trees. The
following block of code sets variable nLive to the size of the current 
live in-memory tree in KB, and nOld to the size of the old in-memory tree in 
KB (or 0 if there is no old in-memory tree).

<verbatim>
  int nOld, nLive;
  rc = lsm_info(db, LSM_INFO_TREE_SIZE, &nOld, &nLive);
</verbatim>

<h3 id=compulsary_work_and_checkpoints>6.3.3. Compulsary Work and Checkpoints</h3>


<p>There are three scenarios where database work or checkpointing may be
performed automatically, regardless of the value of the LSM_CONFIG_AUTOWORK
parameter.

<ul>
  <li> When closing a database connection, and 
  <li> When the number of segments with a common age in the database 
       file grows unacceptably high.
  <li> When the total number of segments in the database file grows 
       unacceptably high.
</ul>

<p>Whenever an lsm_close() call would mean that the total number of 
connections to a database drops to zero, the connection checks if the 
in-memory tree is empty. If not, it is flushed to disk. Both the live and 
old in-memory trees are flushed to disk in this case. It also checks if the
database file has been modified since the most recent checkpoint was 
performed. If so, it also performs a checkpoint. Finally, assuming no error
has occurred, it deletes the log file.

<p>Additionally, whenever a worker wishes to flush an in-memory tree to a new
age=1 segment, it must first ensure that there are less than (N+1) existing 
age=1 segments, where N is the value that the LSM_CONFIG_AUTOMERGE parameter is
set to. If there are already (N+1) or more age=1 segments, they must be merged
into an age=2 segment before a new age=1 segment can be created within the
database file. Similar rules apply to segments of other ages - it is not
possible to create a new age=I segment if there are already (N+1) segments 
with age=I in the database file. This has two implications:

<p>This scenario should never come about if all connections that write to the
database have auto-work enabled. It only occurs if auto-work is disabled and
the lsm_work() function is called too infrequently. In this case it is possible



that flushing an in-memory tree may require writing a tremendous amount of
data to disk (possibly even rewriting the entire database file).



<p>Finally, regardless of age, a database is limited to a maximum of 64
segments in total. If an attempt is made to flush an in-memory tree to disk
when the database already contains 64 segments, two or more existing segments
must be merged together before the new segment can be created.

<h2 id=database_file_optimization>6.4. Database File Optimization</h2>

<p>Database optimization transforms the contents of database file so that
the following are true:

<ul>
  <li> <p>All database content is stored in a single 
       <a href=#architectural_overview>segment</a>. This makes the
................................................................................
       to be visted when searching the database.

  <li> <p>The database file contains no (or as little as possible) free space.
       In other words, it is no larger than required to contain the single
       segment.
</ul>










<p>In order to optimize the database, lsm_work() should be called with 
the nMerge argument set to 1 and the third parameter set to a negative value
(interpreted as - keep working until there is no more work to do). For 
example:

<verbatim>



  rc = lsm_work(db, 1, -1, 0);

</verbatim>

<p>When optimizing the database as above, either the LSM_CONFIG_AUTOCHECKPOINT
parameter should be set to a non-zero value or lsm_checkpoint() should be
called periodically. Otherwise, no checkpoints will be performed, preventing
the library from reusing any space occupied by old segments even after their
content has been merged into the new segment. The result - a database file that
is optimized, except that it is up to twice as large as it otherwise would be.