Rare Corruption using SQLite WASM

(1) By Randy (randyl) on 2023-10-23 17:20:40 [source]

I know there is another thread (https://sqlite.org/forum/forumpost/65b08a58c9b36415) discussing issues with SQLite WASM that affect a fraction of users. I want to share my experience of a similar issue, so I'm starting a new thread.

I have a Chrome extension with around 40K active users. It's been using WebSQL for the past 10 years, and last month it switched to SQLite WASM on OPFS, which involved migrating the user's data to the new storage system. Here are some of the things I've encountered:

Over half a dozen users reported that their data suddenly disappeared after the migration. In nearly every case, it's been caused by the user running a program called CCleaner, which is deleting the extension's data in OPFS. Unchecking the "Internet Cache" box in CCleaner seems to solve the problem.
When Chrome starts up, the extension runs a background job to perform some administrative queries on the database. If the user tries to open the extension while that's running, they get a "SQLITE_BUSY: sqlite3 result code 5: database is locked" error. Once the administrative queries finish, everything works as normal. From what I understand, that's the expected behavior in SQLite WASM; it's just different behavior from WebSQL.
When the extension was initially deployed, a few users got the error "sqlite3 result code 1: no such vfs: opfs" on Chrome >= 109. Rebooting seemed to fix the error. Weird!

Out of 40K users, I've gotten roughly 40 reports about database corruption. Usually it's the error "SQLITE_CORRUPT: sqlite3 result code 11: database disk image is malformed". My advice has been to reinstall the extension and import their data from a backup, but eventually the corruption happens to them again. I haven't been able to reproduce the corruption myself. So far, I haven't noticed any pattern to the errors, other than always happening on Windows (versions 8, 10, and 11). And it's happened on Chrome versions 109, 117, and 118, that I know of.

One user who is running Windows 10 21H2 LTSC claims they can semi-reliably corrupt their 400 MB database by restarting Windows while Chrome is running. They say that closing Chrome first and then restarting Windows does not trigger corruption. Don't know what version of Chrome they're using. I did try this myself on Windows 10 and 11, but didn't encounter any corruption.

The extension has 2 tables plus an FTS5 virtual table. It mostly uses the Promiser API but does use OO1 in a couple of narrowly-focused workers, and uses SQLite's "opfs" vfs.

In the other thread mentioned above, there's a post about whether WASM heap corruption could be the culprit. I don't believe that's the case for me, but I'm not entirely sure what type of code would trigger that.

If anyone has advice on how to fix or even just debug the corruption issues, please let me know. Otherwise, I just wanted to share my experience.

Thanks, Randy

(2) By Anton Dyachenko (ahtoh_) on 2023-10-23 20:14:47 in reply to 1 [link] [source]

I am not familiar with SqliteWasm and don't know limitation sof this version of sqlite but if you suspect opfs and it is not limmited by wasm you can write a trivial shim vfs on top of the opfs that logs what opfs did. Try first search if these logs are already part of opfs you might just switch them on if there are already there. Then you need to either provide a loadable extension with your shim vfs and ask users to load it or you need to create a separate build for them that does this automatically. You can give this special version to those who can relyably reproduce the corruption and then try to analysze the logs or it might be usefull to provide these logs for opfs implementors. In the later case I would ask them what kind of logs would be usefull to investigate corruptions as otherwise you will have to logs pretty much everything and this is quite a lot of data.

(4) By Stephan Beal (stephan) on 2023-10-23 20:26:10 in reply to 2 [link] [source]

Then you need to either provide a loadable extension with your shim vfs and ask users to load it or you need to create a separate build for them that does this automatically.

That part doesn't work in wasm. There has, i've read, been some work in Emscripten to enable loading of DLLs via wasm, but that capability is not part of the current wasm standard.

That said, perhaps we can add such a VFS shim to the JS code which users could enable for logging purposes. The ironic part of that would be that it would introduce yet one more layer for things to go wrong, increasing the risk (however slightly) of corruption while also assisting in tracking it down.

(3) By Stephan Beal (stephan) on 2023-10-23 20:22:42 in reply to 1 [link] [source]

Here are some of the things I've encountered:

Thank you very much for that. Regarding the final one: "no such vfs" can happen for a number of reasons, e.g. delay in loading the Worker which handles the async parts of the OPFS API or the lack of the COOP/COEP HTTP headers. That it is repaired by restarting suggests some environment-specific factor is triggering it.

The idea of WASM heap corruption mentioned in the thread your linked to was really just a shot in the dark. We have had cases of heap corruption caused by faulty test code which (mis)wrote to the WASM heap, but we're not aware of any out-of-bounds writes within the library. When and where WASM heap corruption happens, tracking it down generally involves hair-pulling sessions of guesswork, trial, error, and much questioning one's life choices :/.

One user who is running Windows 10 21H2 LTSC claims they can semi-reliably corrupt their 400 MB database by restarting Windows while Chrome is running.

That doesn't really come as a surprise, does it?

OPFS is a black-box abstraction running in a JavaScript VM in a much larger application framework, all in user-space. It should come as no surprise that that cannot realistically be anywhere near as reliable as a kernel-hosted filesystem, especially when it comes to untimely power loss or crashes in the surrounding native code.

If anyone has advice on how to fix or even just debug the corruption issues, please let me know.

So far we have no concrete insights into specific causes of corruption. Every test we've run and every bit of feedback we've gotten from others suggest that the primary causes of corruption are environment-specific factors (of which there are countless potential culprits). As the author of this particular OPFS VFS, i'll be the very last one to claim that it's 100% certain to be bug-free, but no corruption-inducing bugs are currently known in that particular JS code.

Having a whole stack of code between the public OPFS JS APIs (not to mention that there are at least three separate OPFS implementations) and the physical storage, as well as 3rd-party apps like the CCleaner you mention, leaves a whole universe of potential for bugs and misbehaviors above and beyond any i may personally have authored.

Otherwise, I just wanted to share my experience.

As unsatisfying as this is for both of us, at this point all we can do is keep our ears to the ground and collect experiences like yours to gather more info about possible causes of corruption.

(5) By Randy (randyl) on 2023-10-25 01:37:36 in reply to 3 [link] [source]

Thanks for the reply.

That doesn't really come as a surprise, does it?

No, not really, but it was the first glimmer of hope that this issue might be reproducible, so I wanted to mention it.

As unsatisfying as this is for both of us, at this point all we can do is keep our ears to the ground and collect experiences like yours to gather more info about possible causes of corruption.

If there is anything else I can do to help, or any information I can collect from users who experience corruption, please let me know.