Help needed! Database corruption on embedded system
(1.1) Originally by RaresV (vasile.rares93) with edits by Richard Hipp (drh) on 2020-03-14 15:05:07 from 1.0 [source]
I am trying to use SQLite on an embedded system(on a PowerPC device) but I am facing a corruption error.
The VFS is created over Elm-Chan FatFS stack(used in conjunction with a USB stick) and is very similar to https://github.com/xluoly/raw-os_sqlite/blob/master/sqlite/os_raw.c
The following steps will always reproduce the corruption:
- Create a new DB
- Execute SQL: DROP TABLE IF EXISTS CARS;CREATE TABLE CARS ('id' int, 'value' int);
- Do 4011 inserts: "INSERT INTO CARS VALUES(i, 1000 - i);" , where i = [0-4010]
At the next operation on the DB the SQLite will fail with SQLITE_CORRUPT and "database disk image is corrupted" message. When doing an integrity check over the DB in DB Browser for SQLite, will result in
"*** in database main *** On tree page 68 cell 0: invalid page number 27656608 Page 100 is never used"
I ported the same VFS to Windows, using equivalent file operations and the comparison between PC and embedded databases yielded some differences on 0x20E-0x3FA and 0x8492-0x87FA. The main observation is that on the embedded DB, the regions specified earlier are filled with 0.
Any ideas on how to debug this error? I assume it is somehow caused by the filesystem or the USB drive.
Thank you, Rares
I don't think it's the zeroes that are the corrupt part of the db. Those bytes are most likely unused - and therefore not explicitly set. The embedded platform zeroed them, the win32 platform did not. No big deal.
In the second screenshot, the 4 different byte values on the 85F8 line are "00 00 00 64" on the PC (big-endian integer value 100) and on the embedded platform are "01 A6 01 A0" (big-endian value 27656608). Which are the two values in the corruption report from integrity-check.
Do you get, byte-for-byte, the same corrupt db file every time you run the test? If so, you could try to figure out when and why the 0x01A601A0 value is being written.
(3.1) Originally by RaresV (vasile.rares93) with edits by Richard Hipp (drh) on 2020-03-14 15:07:02 from 3.0 in reply to 2 [link] [source]
Yes, the corrupted DB is the same every time.
I modified the VFS to set a sw breakpoint when the offset corresponds to the corrupted zone.
It looks like the data is "generated" by
pager_write_pagelist(). Screenshot with the stack frame and data can be found in https://1drv.ms/u/s!AkNyZuYZUo8xumhWdCsHHajf3udU?e=H6GD0e
I will continue to investigate, but if you have an idea where I should look please let me know.
Can you set watchpoints in the debugger?
If so, and you can manage to set a watchpoint to trigger when those bytes are written into the buffer, then that stack trace might tell us something.
It's complex. :)
First, set a breakpoint here:
The case you want to catch is when (*pPgno == 68). The buffer that is eventually being passed via sqlite3OsWrite() to your rawWrite() function is there at:
You then want to set a watchpoint at offset 8 of that buffer. Either on a 4 byte or 1 byte variable, probably doesn't matter which.
Math check: Page size is 512 bytes. Page 68 begins at offset (512*67)==34304. Plus 8 makes 34312, or 0x8608 in hex. Which is the offset of the corrupted bytes.
A hunch: Have you set sqlite3PendingByte? It's not your locking function (whatever sqlite3OsLock() calls) writing to those offsets by any chance?
How many bytes long is the file after the 4011th INSERT ?
Have you created any other databases that long on the same platform which do not get reported as corrupt ?
Don't worry about unused data areas being filled with zeros. This is a consequence of how the operating system hands you new space. On some platforms it fills it with zeros to prevent a security vulnerability, on others it does not.
(6) By RaresV (vasile.rares93) on 2020-03-15 18:36:41 in reply to 5 [link] [source]
Due to a momentary hunch, I changed the GCC's sysroot to NewLib instead of EWL and it works.
I was able to do 10000 inserts and the DB is not corrupted.
I suspect that a system function corrupts somehow the memory. But the fix could be also a side effect of the variable placement after the sysroot was changed.
I will continue the investigation and let you know if I find any root cause for this.