How exactly does corruption happen during WAL checkpoint?

(1) By esiy (tempacc423513) on 2024-02-02 08:57:47 [source]

One of the instances mentioned in the docs [1] when it comes to WAL and DB corruption is (section 3.1):

STATEMENT A:

"In WAL mode, the only time that a failed sync operation can cause database corruption is during a checkpoint operation."

The checkpointing itself is described in the docs [2] as a couple of syncs (section 4.3):
- first of WAL itself; prior to the merging the WAL records with the DB; and 
- then of the DB; prior to resetting the WAL.

However, also in the docs [3] regarding performance considerations (section 2.3), it is pointed out that despite the checkpointer attempting "as many sequential page writes" that are interspersed by "many seek operations", even when "a separate thread" is used for checkpointing (so it does not happen with the commit that got it past WAL limit), then the thread "that is doing database queries and updates will never block on a sync operation." Concluding that:

STATEMENT B:

"The downside to this configuration is that transactions are no longer durable and might rollback following a power failure or hard reset."

So, *other than* the loss of durability, what is meant by the "corruption" in STATEMENT A above and how exactly does it occur?

Thanks anyone who helps me not to go read through the code to figure this out.

[1] https://www.sqlite.org/howtocorrupt.html
[2] http://www.sqlite.org/draft/fileformat2.html#walformat
[3] https://www.sqlite.org/wal.html

(2) By Dan Kennedy (dan) on 2024-02-02 10:54:42 in reply to 1 [link] [source]

That section of the docs is talking about hardware or software that does not properly implement the sync operation - i.e. a lying system that tells SQLite that data has been stored safely on the persistent media when it is really still in a cache and will be flushed to disk later.

After SQLite checkpoints a wal file (copies all the data from the wal to the db file) it syncs the database file. And then may overwrite, truncate or delete the wal file. If at that point the OS crashes, or the power fails, and the sync operation was not implemented correctly, some of the data from the wal file might not have made it to persistent storage. And hence the db file will be found to be corrupted when the system recovers.

You really only have a problem if the power fails. That's why hardware/software with a broken sync implementation is able to exist and work under most circumstances.

Dan.

(3.2) By esiy (tempacc423513) on 2024-02-02 13:55:27 edited from 3.1 in reply to 2 [link] [source]

Thank you very much, Dan.

The case with e.g. power loss or data not flushed onto persistent storage is quite straightforward.

I better add context in which I was asking all this. I am aware this would be better handled by backup API, but it's not in scope of this question:

Suppose a file backup tool is copying out the database files on a running system. This means the directory in which the main and/or WAL file might be present is copied out in whichever order.

What is the *worst case* when it comes to the state of that "backed up" SQLite database? I am fully aware durability cannot be guaranteed, however due to the nature of the WAL format and how checkpointing happens, could this result in the DB getting corrupt other than having lost some writes?

My understanding is the worst case scenario is the main file is copied out first, then middle of checkpointing, the WAL file is copied out (or it's been deleted). Other than having just lost (some of) the WAL records and/or some of the pages in the main file having been merged while others not (which happens chronologically, so no inconsistency), what worse could happen?

Thank you.

(4) By Dan Kennedy (dan) on 2024-02-02 13:58:21 in reply to 3.0 [link] [source]

My understanding is the worst case scenario is the DB file is copied out first, then middle of checkpointing, the WAL file is copied out (or it's been deleted) out. Other than having just lost (some of) the WAL records and some of the pages in the main file having been merged while others not (which happens chronologically, so no inconsistency)?

Database file could actually be corrupted at the b-tree level.

Not all checkpoints leave the db file in such a state where it would not look corrupted even if the wal was deleted. The reason is that when checkpointing, SQLite does not copy a page from the wal to the db if there exists a newer version of that page anywhere within the wal. Including in parts of the wal file that cannot be checkpointed due to reader locks.

Dan.

(5) By esiy (tempacc423513) on 2024-02-02 15:35:44 in reply to 4 [link] [source]

Thanks a lot! Thank makes sense now, so basically:

1. Having a copy of the main (DB) file without *corresponding* (from the exact same point in time) WAL is risky in case the last checkpoint had to stop until a reader is done with a page.

2. This must mean however, that also during e.g. power loss (assume reliable fsync), in case it happened exactly after such "hot" checkpoint, when WAL was properly flushed, but the main (DB) file not, there might be the same form of corruption since the WAL-index was lost, or is there any information in the "new" WAL and the "old" main (DB) file to recover from this under all circumstances?

(6) By Dan Kennedy (dan) on 2024-02-02 16:48:08 in reply to 5 [link] [source]

This must mean however, that also during e.g. power loss (assume reliable fsync), in case it happened exactly after such "hot" checkpoint, when WAL was properly flushed, but the main (DB) file not, there might be the same form of corruption since the WAL-index was lost, or is there any information in the "new" WAL and the "old" main (DB) file to recover from this under all circumstances?

If the wal-index is lost (the *-shm file), then it is reconstructed by parsing the entire wal file from start to finish next time a client opens the db. So losing it doesn't matter.

And we can't lose the wal file itself before the db file has been successfully synced, as SQLite does not delete or truncate the wal file until after the successful sync.

I don't think there is a chance of corruption in this case then. Not if sync is working.

Dan.

(7) By esiy (tempacc423513) on 2024-02-02 17:58:15 in reply to 6 [link] [source]

Alright, one last thing if you still have patience with me.

I understand that after a WAL reset, the salts in it are updated to invalidate what was already copied into the main file. But what if this newly reset WAL file is not synced prior to a power loss, i.e. the loss occurs after the new state of the db was synced, but before the reset. Again assume fsync itself is reliable on the system. Is there any mechanism to know that what remained in the WAL *had already been checkpointed* or did I misread something about the way checkpointing works and am asking the wrong way around?

Thanks a lot again for the quick and precise responses!

(9) By Simon Slavin (slavin) on 2024-02-03 14:12:57 in reply to 7 [link] [source]

Upon opening the WAL file, SQLite will realise it does not correspond to the version of the database file it has, and ignore anything in the WAL file. (Explanation simplified for brevity.)

(10) By Dan Kennedy (dan) on 2024-02-03 20:16:12 in reply to 7 [link] [source]

The newly reset wal file is synced after writing the first sector (usually 512 bytes), which is always before the first "transaction committed" frame. So a checkpoint and wal file restart is:

Sync wal file.
Copy data from wal file to database file.
Sync db file.
Update wal file header to restart it.
Sync wal file.
Proceed with writing new transactions to wal file.

If the system fails during step (4) there are two possibilities following recovery - either (a) 1 or more bytes of the wal file header were modified, in which case the checksum will fail and SQLite will ignore the entire wal file, or (b) the entire wal file will remain as it was in step (2). In that case SQLite will need to redo the checkpoint operation, but that's harmless - it just overwrites database pages with the data they already contain.

Dan.

(12) By esiy (tempacc423513) on 2024-02-08 03:40:26 in reply to 10 [link] [source]

This is now crystal clear. Thank you!

(11.1) By esiy (tempacc423513) on 2024-02-08 03:57:31 edited from 11.0 in reply to 4 [link] [source]

I wondered if I could revisit this one as I went to check on sqlite3 specifics when it comes to its pages.

I understand that during a WAL checkpoint, pages cannot be transferred if they are past end mark of any current  reader. It's also clear that only the most recent version of an altered page is to be transferred.

1. Is the only case you meant a situation when there's a leaf page that got transferred, but interior page could not because there's a newer version of it past end mark of a reader that is not done during current invocation of the checkpointer?

2. If the above (1) is the only case, would this be detectable through any sort of checksum mechanism? I suppose not as the only one I found was per page. (Suppose missing or non-matching WAL file as well.)

3. If the above (1) is the only case, this should be recoverable by reconstructing the table b-tree structure from the leaves. And then also reconstruct the index/indices. Whilst the data itself might be stale, it should be consistent from some past time (prior to the checkpoint), correct?

4. Is there any command to run that will safely take care of (3)?

5. Is there a command to indicate an unclean shutdown (whether it resulted in corruption or not)?

Thanks a lot in advance!

(14) By Dan Kennedy (dan) on 2024-02-08 10:58:43 in reply to 11.1 [link] [source]

Is the only case you meant a situation when there's a leaf page that got transferred, but interior page could not because there's a newer version of it past end mark of a reader that is not done during current invocation of the checkpointer?

I don't think you can make any statements like that. It's possible an interior page could be transfered to the db file when its leaves are not. Or it's possible for an index page to be transfered when the corresponding table page is not, etc.

Is there a command to indicate an unclean shutdown (whether it resulted in corruption or not)?

Well, the *-wal file will still be in the file-system. Apart from that, I don't think there is any way to detect an unclean shutdown.

As far as I know, everybody who makes backups this way organizes to suspend write activity while they do it.

Dan.

(8) By Simon Slavin (slavin) on 2024-02-03 14:10:36 in reply to 3.2 [link] [source]

SQLite is designed to fail well. So if the main database and the WAL file are copied at different times, and SQLite told to open the database later, it should look at both files and figure out what combination of data will lead to an uncorrupted database. If you do something completely stupid, like provide a WAL file from a different database, SQLite should figure that out.

The #1 priority is 'uncorrupted database'. In order to do this it may sacrifice recent changes which one file knows about and the other does not.

In analysing the corrupt SQLite files I've seen over the years (one of the things I was paid to do), the corruption were caused by

non-SQLite software or scripts, intentionally or unintentionally, writing to the database file or to memory allocated to SQLite which nothing else should touch
faulty hardware or repeated power failures (which is faulty hardware !)
anti-virus software overwriting the database or journal file, or deleting one of them while SQLite had the database open, or
one single incident of a bug in SQL, fixed years ago, in which the data could be rescued using the SQLite CLI

(13) By esiy (tempacc423513) on 2024-02-08 03:48:05 in reply to 8 [link] [source]

Thank you too, Simon!

I appreciate the empirical experience, I just specifically posted this question to better understand the WAL in particular. I was under the impression it guarantees everything but durability in case of e.g. hardware failure - your case (2). I wanted to specifically understand what is meant by corruption in paragraph 3 of section 3.1. of the docs [1].

I have since gone to check some of the internals described in the pretty comprehensive docs, which are great. But the answer from Dan makes me now rethink how I understood WAL in general, which is valuable.

[1] https://www.sqlite.org/howtocorrupt.html