Hard Reset Causing Database Corruption on iOS
(1.1) By Jim Borden (borrrden) on 2022-05-19 21:51:06 edited from 1.0 [source]
For the past few days I've been chasing down a third party report of database corruption with no success. SQLite is being used via Couchbase Lite Xamarin iOS on an iPhone 6S+ with iOS 13.5.1. I tried to follow what the third party report claimed to be doing, with no success, until on a hunch I decided to do a hard reset (hold home and power until the device abruptly powers off) to simulate perhaps a kernel panic, OS crash, or potentially faulty out of battery handling. The database corruption happened on the first try of this (though further attempts have not happened on the first try, that was just luck).
At first I thought it could be due to memory mapping, as Couchbase Lite faced a similar issue back in late 2016 on Mac OS X (which obviously shares the Darwin kernel). However, after disabling mmap_size the corruption happened again after 4 hard reset cycles. I also tried PRAGMA synchronous=FULL
but no luck with that either, 5 reset cycles caused another corruption.
To be clear, the first thing that the long running test suite does is run PRAGMA integrity_check
and so any corruption is detected pretty early on with minimal interaction from Couchbase Lite (probably simply running a few pragmas and checking the user version). The damage is extensive with dozens of lines reporting bTreeInitPage
failures. Running showdb pgidx
on the database file we received from the third party also shows dozens if not hundreds of invalid page numbers and such. So I would guess it is something off with filesystem sync. I will note that when I enabled PRAGMA fullfsync = ON
that the problem either goes away or becomes much less likely to be hit. I tried full fsync (which appears to now be available on iOS, or maybe it always was) both with and without mmap_size and both survived 10 hard reset loops.
The reason I am writing this is just to bring awareness to these findings in case they are useful, and to ask if there is something else I should try. fullfsync obviously has a very significant performance impact for an issue that has occurred very rarely (if this is not the only third party report, it is one of two that we have received in the better part of a decade).
(2) By Simon Slavin (slavin) on 2022-05-19 05:58:13 in reply to 1.0 [link] [source]
What are you (or apparently Couchbase) using to execute SQLite commands ? Are you calling the SQLite C API, or using a shim for another language, or using a full third-party SQLite library ?
(8.2) By Jim Borden (borrrden) on 2022-05-19 22:02:26 edited from 8.1 in reply to 2 [link] [source]
We use SQLiteCpp which is basically RIIA thin class wrappers around SQLite calls. I totally was not clear that I am posting on behalf of Couchbase and we received the report from a developer of an app using our library, which received a report from one of their users.
(3) By Donal Fellows (dkfellows) on 2022-05-19 08:22:21 in reply to 1.0 [link] [source]
I will note that when I enabled
PRAGMA fullfsync = ON
that the problem either goes away or becomes much less likely to be hit.
Yes. You're exactly seeing the symptoms of what happens when the OS lies about what happens in a sync; the changes to the database aren't durable, and that can result in corruption, depending on what pending changes still queued get dropped by the OS, firmware, and hardware. It's something that user-level code has very little control over.
If you want your data to survive on Apple hardware, you need that pragma and you need to take the performance hit (only down to about the same level of performance that you normally get when running on other operating systems). If you do not do that, the integrity of your data is a house built on sand. (Not all code has those durability requirements; I have an application where I turn off all syncing because I truly don't care about the data if the system resets, but I know for sure that that's a very application-specific decision.)
(4) By KIT.james (kjames3411) on 2022-05-19 13:45:36 in reply to 3 [link] [source]
To be clear, the "lie" part happens on all OSes except macOS. macOS provides the F_FULLFSYNC to cover up for this (also because it is the only OS that can have full control of all hardware used in macs; hardware that does not lie).
(7.1) By Jim Borden (borrrden) on 2022-05-19 22:20:43 edited from 7.0 in reply to 4 [link] [source]
Does that mean that all operating systems are prone to this situation? We had another report on Android a year or two ago that we never solved. We came to the conclusion that it was probably faulty SD cards that they were using but does this mean that this same situation can happen on Android and also there is nothing that can be done about it?
EDIT Assuming obedient drives, is this something that the linux syncfs call could help with for Android?
(9) By Simon Slavin (slavin) on 2022-05-19 23:36:59 in reply to 7.1 [link] [source]
You are experiencing problems which happen a lot on mobile devices, no matter which operating system they use. Mobile devices keep their permanent storage in Flash memory. SD cards are examples of Flash memory. Flash memory has a big problem: writing to it is very slow. Not just a bit slow, but so slow that users would complain your phone isn't working.
So the operating system has two options: either write everything to storage when asked to (so slow it annoys the user) or cache writes in memory until idle (but if the phone loses power or crashes, they're gone).
A good operating system detects when the phone is idle, and writes its cache to storage when the user isn't caring about the phone screen. This is more difficult than it sounds, and takes more memory and battery power than you'd think. iOS (which is a GUI over Darwin, a version of Unix) does this one way. Android does it another way. They both have problems.
So yes, choosing to set synch options mean that saving data takes an annoyingly long time, making your app seem laggy. But it's the only way to do it if your data has to be saved even though the phone may run out of power or crash.
You are already doing the hard part: working out which 'sync' commands actually do sync and which just pretend to.
(11) By KIT.james (kjames3411) on 2022-05-26 12:24:12 in reply to 7.1 [link] [source]
Yes. This is why people making serious barebones projects often use Apple hardware or any hardware that puts quality before money$$$
(5.1) By Scott Perry (numist) on 2022-05-26 23:15:25 edited from 5.0 in reply to 3 [link] [source]
borrrden: If you're using the system-provided SQLite, it would be very helpful if you attached reproducing code to a feedback report and shared the ID here. If you're not using the system-provided SQLite, I'm sure Richard would like to know more about your configuration.
(6.1) By Jim Borden (borrrden) on 2022-05-19 22:39:44 edited from 6.0 in reply to 5.0 [link] [source]
We are building our own and statically linking it, thanks for the suggestions!
EDIT Is upstream support for F_BARRIERFSYNC
planned?
(10.1) By Scott Perry (numist) on 2022-05-26 23:16:25 edited from 10.0 in reply to 6.1 [link] [source]
Yes, reducing divergence between our trees is currently a priority but there's a non-trivial amount of process required. That said, I misspoke in my original reply—barrier vs full syncing is not your problem here.