/ Check-in [09065976]
Login
SQLite training in Houston TX on 2019-11-05 (details)
Part of the 2019 Tcl Conference

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Fix issues and clarify the operation of pager_playback_one_page(). A block comment in pager.c identifies 13 invariants on the pager subsystem. Ticket [9d68c883132c8].
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 0906597698b697ab2993a460f257e326cb58e475
User & Date: drh 2010-04-10 17:52:58
References
2010-06-26
16:30 Ticket [d11f09d3] Cache corruption caused by ROLLBACK with synchronous=NORMAL status still Open with 1 other change artifact: 16a03c75 user: dan
16:17 Ticket [d11f09d3]: 3 changes artifact: 1f6fb18e user: dan
Context
2010-04-12
14:51
Reset the simulated device in the test harness to its default configuration whenever it is restarted. check-in: 562d20e6 user: drh tags: trunk
2010-04-10
17:52
Fix issues and clarify the operation of pager_playback_one_page(). A block comment in pager.c identifies 13 invariants on the pager subsystem. Ticket [9d68c883132c8]. check-in: 09065976 user: drh tags: trunk
2010-04-09
23:05
Add a test case for the OOM-fault corruption issue. Ticket [9d68c883132c8]. check-in: 0a64a937 user: drh tags: trunk
Changes
Hide Diffs Unified Diffs Ignore Whitespace Patch

Changes to src/pager.c.

17
18
19
20
21
22
23

















































































24
25
26
27
28
29
30
...
193
194
195
196
197
198
199
200

201
202
203
204
205
206
207
...
219
220
221
222
223
224
225

226


227
228
229
230
231
232
233
...
279
280
281
282
283
284
285

286
287
288
289
290
291
292
...
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
...
821
822
823
824
825
826
827

828
829
830
831
832
833
834
...
979
980
981
982
983
984
985

986
987
988
989
990
991
992
....
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426

1427
1428
1429
1430
1431
1432
1433
....
1503
1504
1505
1506
1507
1508
1509





1510
1511
1512
1513

1514
1515
1516
1517
1518
1519
1520
....
1558
1559
1560
1561
1562
1563
1564
1565

1566
1567
1568
1569
1570
1571
1572
....
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
....
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
....
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
....
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
....
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
....
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
....
2724
2725
2726
2727
2728
2729
2730
2731
2732
2733
2734
2735
2736
2737
2738
....
2796
2797
2798
2799
2800
2801
2802

2803
2804
2805
2806
2807
2808
2809
....
3658
3659
3660
3661
3662
3663
3664
3665

3666
3667
3668
3669


3670
3671
3672
3673
3674



3675
3676







3677

3678
3679
3680
3681
3682
3683
3684
....
3776
3777
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
3789
3790
....
4208
4209
4210
4211
4212
4213
4214


4215
4216
4217
4218
4219
4220
4221
** locking to prevent two processes from writing the same database
** file simultaneously, or one process from reading the database while
** another is writing.
*/
#ifndef SQLITE_OMIT_DISKIO
#include "sqliteInt.h"


















































































/*
** Macros for troubleshooting.  Normally turned off
*/
#if 0
int sqlite3PagerTrace=1;  /* True to enable tracing */
#define sqlite3DebugPrintf printf
#define PAGERTRACE(X)     if( sqlite3PagerTrace ){ sqlite3DebugPrintf X; }
................................................................................
**   It is cleared at the end of each transaction.
**
**   It is used when committing or otherwise ending a transaction. If
**   the dbModified flag is clear then less work has to be done.
**
** journalStarted
**
**   This flag is set whenever the the main journal is synced. 

**
**   The point of this flag is that it must be set after the 
**   first journal header in a journal file has been synced to disk.
**   After this has happened, new pages appended to the database 
**   do not need the PGHDR_NEED_SYNC flag set, as they do not need
**   to wait for a journal sync before they can be written out to
**   the database file (see function pager_write()).
................................................................................
**   may attempt to commit the transaction again later (calling
**   CommitPhaseOne() again). This flag is used to ensure that the 
**   master journal name is only written to the journal file the first
**   time CommitPhaseOne() is called.
**
** doNotSync
**

**   This variable is set and cleared by sqlite3PagerWrite().


**
** needSync
**
**   TODO: It might be easier to set this variable in writeJournalHdr()
**   and writeMasterJournal() only. Change its meaning to "unsynced data
**   has been written to the journal".
**
................................................................................
  u32 nSubRec;                /* Number of records written to sub-journal */
  Bitvec *pInJournal;         /* One bit for each page in the database file */
  sqlite3_file *fd;           /* File descriptor for database */
  sqlite3_file *jfd;          /* File descriptor for main journal */
  sqlite3_file *sjfd;         /* File descriptor for sub-journal */
  i64 journalOff;             /* Current write offset in the journal file */
  i64 journalHdr;             /* Byte offset to previous journal header */

  PagerSavepoint *aSavepoint; /* Array of active savepoints */
  int nSavepoint;             /* Number of elements in aSavepoint[] */
  char dbFileVers[16];        /* Changes whenever database file changes */
  u32 sectorSize;             /* Assumed sector size during rollback */

  u16 nExtra;                 /* Add this many bytes to each in-memory page */
  i16 nReserve;               /* Number of unused bytes at end of each page */
................................................................................
#ifdef SQLITE_HAS_CODEC
  void *(*xCodec)(void*,void*,Pgno,int); /* Routine for en/decoding data */
  void (*xCodecSizeChng)(void*,int,int); /* Notify of page size changes */
  void (*xCodecFree)(void*);             /* Destructor for the codec */
  void *pCodec;               /* First argument to xCodec... methods */
#endif
  char *pTmpSpace;            /* Pager.pageSize bytes of space for tmp use */
  i64 journalSizeLimit;       /* Size limit for persistent journal files */
  PCache *pPCache;            /* Pointer to page cache object */
  sqlite3_backup *pBackup;    /* Pointer to list of ongoing backup processes */
};

/*
** The following global variables hold counters used for
** testing purposes only.  These variables do not exist in
................................................................................
  ** database page size. Since the zHeader buffer is only Pager.pageSize
  ** bytes in size, more than one call to sqlite3OsWrite() may be required
  ** to populate the entire journal header sector.
  */ 
  for(nWrite=0; rc==SQLITE_OK&&nWrite<JOURNAL_HDR_SZ(pPager); nWrite+=nHeader){
    IOTRACE(("JHDR %p %lld %d\n", pPager, pPager->journalHdr, nHeader))
    rc = sqlite3OsWrite(pPager->jfd, zHeader, nHeader, pPager->journalOff);

    pPager->journalOff += nHeader;
  }

  return rc;
}

/*
................................................................................
   || pPager->journalMode==PAGER_JOURNALMODE_MEMORY 
   || pPager->journalMode==PAGER_JOURNALMODE_OFF 
  ){
    return SQLITE_OK;
  }
  pPager->setMaster = 1;
  assert( isOpen(pPager->jfd) );


  /* Calculate the length in bytes and the checksum of zMaster */
  for(nMaster=0; zMaster[nMaster]; nMaster++){
    cksum += zMaster[nMaster];
  }

  /* If in full-sync mode, advance to the next disk sector before writing
................................................................................
**
** If this is a savepoint rollback, then memory may have to be dynamically
** allocated by this function. If this is the case and an allocation fails,
** SQLITE_NOMEM is returned.
*/
static int pager_playback_one_page(
  Pager *pPager,                /* The pager being played back */
  int isMainJrnl,               /* 1 -> main journal. 0 -> sub-journal. */
  int isUnsync,                 /* True if reading from unsynced main journal */
  i64 *pOffset,                 /* Offset of record to playback */
  int isSavepnt,                /* True for a savepoint rollback */
  Bitvec *pDone                 /* Bitvec of pages already played back */
){
  int rc;
  PgHdr *pPg;                   /* An existing page in the cache */
  Pgno pgno;                    /* The page number of a page in journal */
  u32 cksum;                    /* Checksum used for sanity checking */
  char *aData;                  /* Temporary storage for the page */
  sqlite3_file *jfd;            /* The file descriptor for the journal file */


  assert( (isMainJrnl&~1)==0 );      /* isMainJrnl is 0 or 1 */
  assert( (isSavepnt&~1)==0 );       /* isSavepnt is 0 or 1 */
  assert( isMainJrnl || pDone );     /* pDone always used on sub-journals */
  assert( isSavepnt || pDone==0 );   /* pDone never used on non-savepoint */

  aData = pPager->pTmpSpace;
................................................................................
  */
  pPg = pager_lookup(pPager, pgno);
  assert( pPg || !MEMDB );
  PAGERTRACE(("PLAYBACK %d page %d hash(%08x) %s\n",
           PAGERID(pPager), pgno, pager_datahash(pPager->pageSize, (u8*)aData),
           (isMainJrnl?"main-journal":"sub-journal")
  ));





  if( (pPager->state>=PAGER_EXCLUSIVE)
   && (!isSavepnt || pPg==0 || 0==(pPg->flags&PGHDR_NEED_SYNC))
   && isOpen(pPager->fd)
   && !isUnsync

  ){
    i64 ofst = (pgno-1)*(i64)pPager->pageSize;
    testcase( !isSavepnt && pPg!=0 && (pPg->flags&PGHDR_NEED_SYNC)!=0 );
    rc = sqlite3OsWrite(pPager->fd, (u8*)aData, pPager->pageSize, ofst);
    if( pgno>pPager->dbFileSize ){
      pPager->dbFileSize = pgno;
    }
................................................................................
    pData = pPg->pData;
    memcpy(pData, (u8*)aData, pPager->pageSize);
    pPager->xReiniter(pPg);
    if( isMainJrnl && (!isSavepnt || *pOffset<=pPager->journalHdr) ){
      /* If the contents of this page were just restored from the main 
      ** journal file, then its content must be as they were when the 
      ** transaction was first opened. In this case we can mark the page
      ** as clean, since there will be no need to write it out to the.

      **
      ** There is one exception to this rule. If the page is being rolled
      ** back as part of a savepoint (or statement) rollback from an 
      ** unsynced portion of the main journal file, then it is not safe
      ** to mark the page as clean. This is because marking the page as
      ** clean will clear the PGHDR_NEED_SYNC flag. Since the page is
      ** already in the journal file (recorded in Pager.pInJournal) and
................................................................................
  needPagerReset = isHot;

  /* This loop terminates either when a readJournalHdr() or 
  ** pager_playback_one_page() call returns SQLITE_DONE or an IO error 
  ** occurs. 
  */
  while( 1 ){
    int isUnsync = 0;

    /* Read the next journal header from the journal file.  If there are
    ** not enough bytes left in the journal file for a complete header, or
    ** it is corrupted, then a process must of failed while writing it.
    ** This indicates nothing more needs to be rolled back.
    */
    rc = readJournalHdr(pPager, isHot, szJ, &nRec, &mxPg);
    if( rc!=SQLITE_OK ){ 
................................................................................
    ** the journal, it means that the journal might contain additional
    ** pages that need to be rolled back and that the number of pages 
    ** should be computed based on the journal file size.
    */
    if( nRec==0 && !isHot &&
        pPager->journalHdr+JOURNAL_HDR_SZ(pPager)==pPager->journalOff ){
      nRec = (int)((szJ - pPager->journalOff) / JOURNAL_PG_SZ(pPager));
      isUnsync = 1;
    }

    /* If this is the first header read from the journal, truncate the
    ** database file back to its original size.
    */
    if( pPager->journalOff==JOURNAL_HDR_SZ(pPager) ){
      rc = pager_truncate(pPager, mxPg);
................................................................................
    ** database file and/or page cache.
    */
    for(u=0; u<nRec; u++){
      if( needPagerReset ){
        pager_reset(pPager);
        needPagerReset = 0;
      }
      rc = pager_playback_one_page(pPager,1,isUnsync,&pPager->journalOff,0,0);
      if( rc!=SQLITE_OK ){
        if( rc==SQLITE_DONE ){
          rc = SQLITE_OK;
          pPager->journalOff = szJ;
          break;
        }else if( rc==SQLITE_IOERR_SHORT_READ ){
          /* If the journal has been truncated, simply stop reading and
................................................................................
  ** will be skipped automatically.  Pages are added to pDone as they
  ** are played back.
  */
  if( pSavepoint ){
    iHdrOff = pSavepoint->iHdrOffset ? pSavepoint->iHdrOffset : szJ;
    pPager->journalOff = pSavepoint->iOffset;
    while( rc==SQLITE_OK && pPager->journalOff<iHdrOff ){
      rc = pager_playback_one_page(pPager, 1, 0, &pPager->journalOff, 1, pDone);
    }
    assert( rc!=SQLITE_DONE );
  }else{
    pPager->journalOff = 0;
  }

  /* Continue rolling back records out of the main journal starting at
................................................................................
    */
    if( nJRec==0 
     && pPager->journalHdr+JOURNAL_HDR_SZ(pPager)==pPager->journalOff
    ){
      nJRec = (u32)((szJ - pPager->journalOff)/JOURNAL_PG_SZ(pPager));
    }
    for(ii=0; rc==SQLITE_OK && ii<nJRec && pPager->journalOff<szJ; ii++){
      rc = pager_playback_one_page(pPager, 1, 0, &pPager->journalOff, 1, pDone);
    }
    assert( rc!=SQLITE_DONE );
  }
  assert( rc!=SQLITE_OK || pPager->journalOff==szJ );

  /* Finally,  rollback pages from the sub-journal.  Page that were
  ** previously rolled back out of the main journal (and are hence in pDone)
................................................................................
  ** will be skipped.  Out-of-range pages are also skipped.
  */
  if( pSavepoint ){
    u32 ii;            /* Loop counter */
    i64 offset = pSavepoint->iSubRec*(4+pPager->pageSize);
    for(ii=pSavepoint->iSubRec; rc==SQLITE_OK && ii<pPager->nSubRec; ii++){
      assert( offset==ii*(4+pPager->pageSize) );
      rc = pager_playback_one_page(pPager, 0, 0, &offset, 1, pDone);
    }
    assert( rc!=SQLITE_DONE );
  }

  sqlite3BitvecDestroy(pDone);
  if( rc==SQLITE_OK ){
    pPager->journalOff = szJ;
................................................................................

      if( 0==(iDc&SQLITE_IOCAP_SAFE_APPEND) ){
        /* This block deals with an obscure problem. If the last connection
        ** that wrote to this database was operating in persistent-journal
        ** mode, then the journal file may at this point actually be larger
        ** than Pager.journalOff bytes. If the next thing in the journal
        ** file happens to be a journal-header (written as part of the
        ** previous connections transaction), and a crash or power-failure 
        ** occurs after nRec is updated but before this connection writes 
        ** anything else to the journal file (or commits/rolls back its 
        ** transaction), then SQLite may become confused when doing the 
        ** hot-journal rollback following recovery. It may roll back all
        ** of this connections data, then proceed to rolling back the old,
        ** out-of-date data that follows it. Database corruption.
        **
................................................................................
    }

    /* The journal file was just successfully synced. Set Pager.needSync 
    ** to zero and clear the PGHDR_NEED_SYNC flag on all pagess.
    */
    pPager->needSync = 0;
    pPager->journalStarted = 1;

    sqlite3PcacheClearSyncFlags(pPager->pPCache);
  }

  return SQLITE_OK;
}

/*
................................................................................
          }
        }
      }
      if( rc!=SQLITE_OK ){
        goto failed;
      }

      /* TODO: Why are these cleared here? Is it necessary? */

      pPager->journalStarted = 0;
      pPager->journalOff = 0;
      pPager->setMaster = 0;
      pPager->journalHdr = 0;


 
      /* Playback and delete the journal.  Drop the database write
      ** lock and reacquire the read lock. Purge the cache before
      ** playing back the hot-journal so that we don't end up with
      ** an inconsistent cache.



      */
      if( isOpen(pPager->jfd) ){







        rc = pager_playback(pPager, 1);

        if( rc!=SQLITE_OK ){
          rc = pager_error(pPager, rc);
          goto failed;
        }
      }
      assert( (pPager->state==PAGER_SHARED)
           || (pPager->exclusiveMode && pPager->state>PAGER_SHARED)
................................................................................
**
** If noContent is true, it means that we do not care about the contents
** of the page. This occurs in two seperate scenarios:
**
**   a) When reading a free-list leaf page from the database, and
**
**   b) When a savepoint is being rolled back and we need to load
**      a new page into the cache to populate with the data read
**      from the savepoint journal.
**
** If noContent is true, then the data returned is zeroed instead of
** being read from the database. Additionally, the bits corresponding
** to pgno in Pager.pInJournal (bitvec of pages already written to the
** journal file) and the PagerSavepoint.pInSavepoint bitvecs of any open
** savepoints are set. This means if the page is made writable at any
................................................................................
        u32 cksum;
        char *pData2;

        /* We should never write to the journal file the page that
        ** contains the database locks.  The following assert verifies
        ** that we do not. */
        assert( pPg->pgno!=PAGER_MJ_PGNO(pPager) );


        CODEC2(pPager, pData, pPg->pgno, 7, return SQLITE_NOMEM, pData2);
        cksum = pager_cksum(pPager, (u8*)pData2);
        rc = write32bits(pPager->jfd, pPager->journalOff, pPg->pgno);
        if( rc==SQLITE_OK ){
          rc = sqlite3OsWrite(pPager->jfd, pData2, pPager->pageSize,
                              pPager->journalOff + 4);
          pPager->journalOff += pPager->pageSize+4;







>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>







 







|
>







 







>
|
>
>







 







>







 







<







 







>







 







>







 







|
|
|
|
<







>







 







>
>
>
>
>

<

<
>







 







|
>







 







<
<







 







<







 







|







 







|







 







|







 







|







 







|







 







>







 







|
>




>
>




|
>
>
>


>
>
>
>
>
>
>
|
>







 







|







 







>
>







17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
...
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
...
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
...
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
...
391
392
393
394
395
396
397

398
399
400
401
402
403
404
...
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
....
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
....
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505

1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
....
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602

1603

1604
1605
1606
1607
1608
1609
1610
1611
....
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
....
1997
1998
1999
2000
2001
2002
2003


2004
2005
2006
2007
2008
2009
2010
....
2037
2038
2039
2040
2041
2042
2043

2044
2045
2046
2047
2048
2049
2050
....
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
....
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
....
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
....
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
....
2813
2814
2815
2816
2817
2818
2819
2820
2821
2822
2823
2824
2825
2826
2827
....
2885
2886
2887
2888
2889
2890
2891
2892
2893
2894
2895
2896
2897
2898
2899
....
3748
3749
3750
3751
3752
3753
3754
3755
3756
3757
3758
3759
3760
3761
3762
3763
3764
3765
3766
3767
3768
3769
3770
3771
3772
3773
3774
3775
3776
3777
3778
3779
3780
3781
3782
3783
3784
3785
3786
3787
3788
....
3880
3881
3882
3883
3884
3885
3886
3887
3888
3889
3890
3891
3892
3893
3894
....
4312
4313
4314
4315
4316
4317
4318
4319
4320
4321
4322
4323
4324
4325
4326
4327
** locking to prevent two processes from writing the same database
** file simultaneously, or one process from reading the database while
** another is writing.
*/
#ifndef SQLITE_OMIT_DISKIO
#include "sqliteInt.h"

/*
******************** NOTES ON THE DESIGN OF THE PAGER ************************
**
** Within this comment block, a page is deemed to have been synced
** automatically as soon as it is written when PRAGMA synchronous=OFF.
** Otherwise, the page is not synced until the xSync method of the VFS
** is called successfully on the file containing the page.
**
** Definition:  A page of the database file is said to be "overwriteable" if
** one or more of the following are true about the page:
** 
**     (a)  The original content of the page as it was at the beginning of
**          the transaction has been written into the rollback journal and
**          synced.
** 
**     (b)  The page was a freelist leaf page at the start of the transaction.
** 
**     (c)  The page number is greater than the largest page that existed in
**          the database file at the start of the transaction.
** 
** (1) A page of the database file is never overwritten unless one of the
**     following are true:
** 
**     (a) The page and all other pages on the same sector are overwriteable.
** 
**     (b) The atomic page write optimization is enabled, and the entire
**         transaction other than the update of the transaction sequence
**         number consists of a single page change.
** 
** (2) The content of a page written into the rollback journal exactly matches
**     both the content in the database when the rollback journal was written
**     and the content in the database at the beginning of the current
**     transaction.
** 
** (3) Writes to the database file are an integer multiple of the page size
**     in length and are aligned to a page boundary.
** 
** (4) Reads from the database file are either aligned on a page boundary and
**     an integer multiple of the page size in length or are taken from the
**     first 100 bytes of the database file.
** 
** (5) All writes to the database file are synced prior to the rollback journal
**     being deleted, truncated, or zeroed.
** 
** (6) If a master journal file is used, then all writes to the database file
**     are synced prior to the master journal being deleted.
** 
** Definition: Two databases (or the same database at two points it time)
** are said to be "logically equivalent" if they give the same answer to
** all queries.  Note in particular the the content of freelist leaf
** pages can be changed arbitarily without effecting the logical equivalence
** of the database.
** 
** (7) At any time, if any subset, including the empty set and the total set,
**     of the unsynced changes to a rollback journal are removed and the 
**     journal is rolled back, the resulting database file will be logical
**     equivalent to the database file at the beginning of the transaction.
** 
** (8) When a transaction is rolled back, the xTruncate method of the VFS
**     is called to restore the database file to the same size it was at
**     the beginning of the transaction.  (In some VFSes, the xTruncate
**     method is a no-op, but that does not change the fact the SQLite will
**     invoke it.)
** 
** (9) Whenever the database file is modified, at least one bit in the range
**     of bytes from 24 through 39 inclusive will be changed prior to releasing
**     the EXCLUSIVE lock.
**
** (10) The pattern of bits in bytes 24 through 39 shall not repeat in less
**      than one billion transactions.
**
** (11) A database file is well-formed at the beginning and at the conclusion
**      of every transaction.
**
** (12) An EXCLUSIVE lock is held on the database file when writing to
**      the database file.
**
** (13) A SHARED lock is held on the database file while reading any
**      content out of the database file.
*/

/*
** Macros for troubleshooting.  Normally turned off
*/
#if 0
int sqlite3PagerTrace=1;  /* True to enable tracing */
#define sqlite3DebugPrintf printf
#define PAGERTRACE(X)     if( sqlite3PagerTrace ){ sqlite3DebugPrintf X; }
................................................................................
**   It is cleared at the end of each transaction.
**
**   It is used when committing or otherwise ending a transaction. If
**   the dbModified flag is clear then less work has to be done.
**
** journalStarted
**
**   This flag is set whenever the the main journal is opened and
**   initialized
**
**   The point of this flag is that it must be set after the 
**   first journal header in a journal file has been synced to disk.
**   After this has happened, new pages appended to the database 
**   do not need the PGHDR_NEED_SYNC flag set, as they do not need
**   to wait for a journal sync before they can be written out to
**   the database file (see function pager_write()).
................................................................................
**   may attempt to commit the transaction again later (calling
**   CommitPhaseOne() again). This flag is used to ensure that the 
**   master journal name is only written to the journal file the first
**   time CommitPhaseOne() is called.
**
** doNotSync
**
**   When enabled, cache spills are prohibited and the journal file cannot
**   be synced.  This variable is set and cleared by sqlite3PagerWrite() 
**   in order to prevent a journal sync from happening in between the
**   journalling of two pages on the same sector.
**
** needSync
**
**   TODO: It might be easier to set this variable in writeJournalHdr()
**   and writeMasterJournal() only. Change its meaning to "unsynced data
**   has been written to the journal".
**
................................................................................
  u32 nSubRec;                /* Number of records written to sub-journal */
  Bitvec *pInJournal;         /* One bit for each page in the database file */
  sqlite3_file *fd;           /* File descriptor for database */
  sqlite3_file *jfd;          /* File descriptor for main journal */
  sqlite3_file *sjfd;         /* File descriptor for sub-journal */
  i64 journalOff;             /* Current write offset in the journal file */
  i64 journalHdr;             /* Byte offset to previous journal header */
  i64 journalSizeLimit;       /* Size limit for persistent journal files */
  PagerSavepoint *aSavepoint; /* Array of active savepoints */
  int nSavepoint;             /* Number of elements in aSavepoint[] */
  char dbFileVers[16];        /* Changes whenever database file changes */
  u32 sectorSize;             /* Assumed sector size during rollback */

  u16 nExtra;                 /* Add this many bytes to each in-memory page */
  i16 nReserve;               /* Number of unused bytes at end of each page */
................................................................................
#ifdef SQLITE_HAS_CODEC
  void *(*xCodec)(void*,void*,Pgno,int); /* Routine for en/decoding data */
  void (*xCodecSizeChng)(void*,int,int); /* Notify of page size changes */
  void (*xCodecFree)(void*);             /* Destructor for the codec */
  void *pCodec;               /* First argument to xCodec... methods */
#endif
  char *pTmpSpace;            /* Pager.pageSize bytes of space for tmp use */

  PCache *pPCache;            /* Pointer to page cache object */
  sqlite3_backup *pBackup;    /* Pointer to list of ongoing backup processes */
};

/*
** The following global variables hold counters used for
** testing purposes only.  These variables do not exist in
................................................................................
  ** database page size. Since the zHeader buffer is only Pager.pageSize
  ** bytes in size, more than one call to sqlite3OsWrite() may be required
  ** to populate the entire journal header sector.
  */ 
  for(nWrite=0; rc==SQLITE_OK&&nWrite<JOURNAL_HDR_SZ(pPager); nWrite+=nHeader){
    IOTRACE(("JHDR %p %lld %d\n", pPager, pPager->journalHdr, nHeader))
    rc = sqlite3OsWrite(pPager->jfd, zHeader, nHeader, pPager->journalOff);
    assert( pPager->journalHdr <= pPager->journalOff );
    pPager->journalOff += nHeader;
  }

  return rc;
}

/*
................................................................................
   || pPager->journalMode==PAGER_JOURNALMODE_MEMORY 
   || pPager->journalMode==PAGER_JOURNALMODE_OFF 
  ){
    return SQLITE_OK;
  }
  pPager->setMaster = 1;
  assert( isOpen(pPager->jfd) );
  assert( pPager->journalHdr <= pPager->journalOff );

  /* Calculate the length in bytes and the checksum of zMaster */
  for(nMaster=0; zMaster[nMaster]; nMaster++){
    cksum += zMaster[nMaster];
  }

  /* If in full-sync mode, advance to the next disk sector before writing
................................................................................
**
** If this is a savepoint rollback, then memory may have to be dynamically
** allocated by this function. If this is the case and an allocation fails,
** SQLITE_NOMEM is returned.
*/
static int pager_playback_one_page(
  Pager *pPager,                /* The pager being played back */
  i64 *pOffset,                 /* Offset of record to playback */
  Bitvec *pDone,                /* Bitvec of pages already played back */
  int isMainJrnl,               /* 1 -> main journal. 0 -> sub-journal. */
  int isSavepnt                 /* True for a savepoint rollback */

){
  int rc;
  PgHdr *pPg;                   /* An existing page in the cache */
  Pgno pgno;                    /* The page number of a page in journal */
  u32 cksum;                    /* Checksum used for sanity checking */
  char *aData;                  /* Temporary storage for the page */
  sqlite3_file *jfd;            /* The file descriptor for the journal file */
  int isSynced;                 /* True if journal page is synced */

  assert( (isMainJrnl&~1)==0 );      /* isMainJrnl is 0 or 1 */
  assert( (isSavepnt&~1)==0 );       /* isSavepnt is 0 or 1 */
  assert( isMainJrnl || pDone );     /* pDone always used on sub-journals */
  assert( isSavepnt || pDone==0 );   /* pDone never used on non-savepoint */

  aData = pPager->pTmpSpace;
................................................................................
  */
  pPg = pager_lookup(pPager, pgno);
  assert( pPg || !MEMDB );
  PAGERTRACE(("PLAYBACK %d page %d hash(%08x) %s\n",
           PAGERID(pPager), pgno, pager_datahash(pPager->pageSize, (u8*)aData),
           (isMainJrnl?"main-journal":"sub-journal")
  ));
  if( isMainJrnl ){
    isSynced = pPager->noSync || (*pOffset <= pPager->journalHdr);
  }else{
    isSynced = (pPg==0 || 0==(pPg->flags & PGHDR_NEED_SYNC));
  }
  if( (pPager->state>=PAGER_EXCLUSIVE)

   && isOpen(pPager->fd)

   && isSynced
  ){
    i64 ofst = (pgno-1)*(i64)pPager->pageSize;
    testcase( !isSavepnt && pPg!=0 && (pPg->flags&PGHDR_NEED_SYNC)!=0 );
    rc = sqlite3OsWrite(pPager->fd, (u8*)aData, pPager->pageSize, ofst);
    if( pgno>pPager->dbFileSize ){
      pPager->dbFileSize = pgno;
    }
................................................................................
    pData = pPg->pData;
    memcpy(pData, (u8*)aData, pPager->pageSize);
    pPager->xReiniter(pPg);
    if( isMainJrnl && (!isSavepnt || *pOffset<=pPager->journalHdr) ){
      /* If the contents of this page were just restored from the main 
      ** journal file, then its content must be as they were when the 
      ** transaction was first opened. In this case we can mark the page
      ** as clean, since there will be no need to write it out to the
      ** database.
      **
      ** There is one exception to this rule. If the page is being rolled
      ** back as part of a savepoint (or statement) rollback from an 
      ** unsynced portion of the main journal file, then it is not safe
      ** to mark the page as clean. This is because marking the page as
      ** clean will clear the PGHDR_NEED_SYNC flag. Since the page is
      ** already in the journal file (recorded in Pager.pInJournal) and
................................................................................
  needPagerReset = isHot;

  /* This loop terminates either when a readJournalHdr() or 
  ** pager_playback_one_page() call returns SQLITE_DONE or an IO error 
  ** occurs. 
  */
  while( 1 ){


    /* Read the next journal header from the journal file.  If there are
    ** not enough bytes left in the journal file for a complete header, or
    ** it is corrupted, then a process must of failed while writing it.
    ** This indicates nothing more needs to be rolled back.
    */
    rc = readJournalHdr(pPager, isHot, szJ, &nRec, &mxPg);
    if( rc!=SQLITE_OK ){ 
................................................................................
    ** the journal, it means that the journal might contain additional
    ** pages that need to be rolled back and that the number of pages 
    ** should be computed based on the journal file size.
    */
    if( nRec==0 && !isHot &&
        pPager->journalHdr+JOURNAL_HDR_SZ(pPager)==pPager->journalOff ){
      nRec = (int)((szJ - pPager->journalOff) / JOURNAL_PG_SZ(pPager));

    }

    /* If this is the first header read from the journal, truncate the
    ** database file back to its original size.
    */
    if( pPager->journalOff==JOURNAL_HDR_SZ(pPager) ){
      rc = pager_truncate(pPager, mxPg);
................................................................................
    ** database file and/or page cache.
    */
    for(u=0; u<nRec; u++){
      if( needPagerReset ){
        pager_reset(pPager);
        needPagerReset = 0;
      }
      rc = pager_playback_one_page(pPager,&pPager->journalOff,0,1,0);
      if( rc!=SQLITE_OK ){
        if( rc==SQLITE_DONE ){
          rc = SQLITE_OK;
          pPager->journalOff = szJ;
          break;
        }else if( rc==SQLITE_IOERR_SHORT_READ ){
          /* If the journal has been truncated, simply stop reading and
................................................................................
  ** will be skipped automatically.  Pages are added to pDone as they
  ** are played back.
  */
  if( pSavepoint ){
    iHdrOff = pSavepoint->iHdrOffset ? pSavepoint->iHdrOffset : szJ;
    pPager->journalOff = pSavepoint->iOffset;
    while( rc==SQLITE_OK && pPager->journalOff<iHdrOff ){
      rc = pager_playback_one_page(pPager, &pPager->journalOff, pDone, 1, 1);
    }
    assert( rc!=SQLITE_DONE );
  }else{
    pPager->journalOff = 0;
  }

  /* Continue rolling back records out of the main journal starting at
................................................................................
    */
    if( nJRec==0 
     && pPager->journalHdr+JOURNAL_HDR_SZ(pPager)==pPager->journalOff
    ){
      nJRec = (u32)((szJ - pPager->journalOff)/JOURNAL_PG_SZ(pPager));
    }
    for(ii=0; rc==SQLITE_OK && ii<nJRec && pPager->journalOff<szJ; ii++){
      rc = pager_playback_one_page(pPager, &pPager->journalOff, pDone, 1, 1);
    }
    assert( rc!=SQLITE_DONE );
  }
  assert( rc!=SQLITE_OK || pPager->journalOff==szJ );

  /* Finally,  rollback pages from the sub-journal.  Page that were
  ** previously rolled back out of the main journal (and are hence in pDone)
................................................................................
  ** will be skipped.  Out-of-range pages are also skipped.
  */
  if( pSavepoint ){
    u32 ii;            /* Loop counter */
    i64 offset = pSavepoint->iSubRec*(4+pPager->pageSize);
    for(ii=pSavepoint->iSubRec; rc==SQLITE_OK && ii<pPager->nSubRec; ii++){
      assert( offset==ii*(4+pPager->pageSize) );
      rc = pager_playback_one_page(pPager, &offset, pDone, 0, 1);
    }
    assert( rc!=SQLITE_DONE );
  }

  sqlite3BitvecDestroy(pDone);
  if( rc==SQLITE_OK ){
    pPager->journalOff = szJ;
................................................................................

      if( 0==(iDc&SQLITE_IOCAP_SAFE_APPEND) ){
        /* This block deals with an obscure problem. If the last connection
        ** that wrote to this database was operating in persistent-journal
        ** mode, then the journal file may at this point actually be larger
        ** than Pager.journalOff bytes. If the next thing in the journal
        ** file happens to be a journal-header (written as part of the
        ** previous connection's transaction), and a crash or power-failure 
        ** occurs after nRec is updated but before this connection writes 
        ** anything else to the journal file (or commits/rolls back its 
        ** transaction), then SQLite may become confused when doing the 
        ** hot-journal rollback following recovery. It may roll back all
        ** of this connections data, then proceed to rolling back the old,
        ** out-of-date data that follows it. Database corruption.
        **
................................................................................
    }

    /* The journal file was just successfully synced. Set Pager.needSync 
    ** to zero and clear the PGHDR_NEED_SYNC flag on all pagess.
    */
    pPager->needSync = 0;
    pPager->journalStarted = 1;
    pPager->journalHdr = pPager->journalOff;
    sqlite3PcacheClearSyncFlags(pPager->pPCache);
  }

  return SQLITE_OK;
}

/*
................................................................................
          }
        }
      }
      if( rc!=SQLITE_OK ){
        goto failed;
      }

      /* Reset the journal status fields to indicates that we have no
      ** rollback journal at this time. */
      pPager->journalStarted = 0;
      pPager->journalOff = 0;
      pPager->setMaster = 0;
      pPager->journalHdr = 0;

      /* Make sure the journal file has been synced to disk. */
 
      /* Playback and delete the journal.  Drop the database write
      ** lock and reacquire the read lock. Purge the cache before
      ** playing back the hot-journal so that we don't end up with
      ** an inconsistent cache.  Sync the hot journal before playing
      ** it back since the process that crashed and left the hot journal
      ** probably did not sync it and we are required to always sync
      ** the journal before playing it back.
      */
      if( isOpen(pPager->jfd) ){
        if( !pPager->noSync ){
          rc = sqlite3OsSync(pPager->jfd, SQLITE_SYNC_NORMAL);
        }
        if( rc==SQLITE_OK ){
          rc = sqlite3OsFileSize(pPager->jfd, &pPager->journalHdr);
        }
        if( rc==SQLITE_OK ){
          rc = pager_playback(pPager, 1);
        }
        if( rc!=SQLITE_OK ){
          rc = pager_error(pPager, rc);
          goto failed;
        }
      }
      assert( (pPager->state==PAGER_SHARED)
           || (pPager->exclusiveMode && pPager->state>PAGER_SHARED)
................................................................................
**
** If noContent is true, it means that we do not care about the contents
** of the page. This occurs in two seperate scenarios:
**
**   a) When reading a free-list leaf page from the database, and
**
**   b) When a savepoint is being rolled back and we need to load
**      a new page into the cache to be filled with the data read
**      from the savepoint journal.
**
** If noContent is true, then the data returned is zeroed instead of
** being read from the database. Additionally, the bits corresponding
** to pgno in Pager.pInJournal (bitvec of pages already written to the
** journal file) and the PagerSavepoint.pInSavepoint bitvecs of any open
** savepoints are set. This means if the page is made writable at any
................................................................................
        u32 cksum;
        char *pData2;

        /* We should never write to the journal file the page that
        ** contains the database locks.  The following assert verifies
        ** that we do not. */
        assert( pPg->pgno!=PAGER_MJ_PGNO(pPager) );

        assert( pPager->journalHdr <= pPager->journalOff );
        CODEC2(pPager, pData, pPg->pgno, 7, return SQLITE_NOMEM, pData2);
        cksum = pager_cksum(pPager, (u8*)pData2);
        rc = write32bits(pPager->jfd, pPager->journalOff, pPg->pgno);
        if( rc==SQLITE_OK ){
          rc = sqlite3OsWrite(pPager->jfd, pData2, pPager->pageSize,
                              pPager->journalOff + 4);
          pPager->journalOff += pPager->pageSize+4;