Multi-connection access to database on USB drive

(1) By Deon Brewis (deonb) on 2022-01-27 01:14:16 [link] [source]

Currently we support storing our database files on USB drives, but over the years we've learned that in order not to corrupt it if the user surprise-removes the drive, we have to do:

PRAGMA MMAP_SIZE=0 PRAGMA locking_mode=EXCLUSIVE PRAGMA JOURNAL_MODE = WAL PRAGMA synchronous = NORMAL

This gives us a relatively reliable database, even if the user unplugs the USB drive while it's busy writing to SQLITE. It sometimes still corrupts on surprise-remove but orders of magnitude less.

However, I would really love to be able to have a second thread read from that USB drive (ACID semantics, not just mutexes). We generally would do this on a non-USB drive by opening a second connection, but the locking_mode EXCLUSIVE prevents that on USB.

I don't quite understand what the combination of EXCLUSIVE + WAL does and what causes it to be able to survive a surprise-removal. Is part of that mechanism maybe available outside of exclusive?

(2) By Keith Medcalf (kmedcalf) on 2022-01-27 04:12:33 in reply to 1 [link] [source]

What Operating System and drive settings?

(3) By Deon Brewis (deonb) on 2022-01-27 08:09:09 in reply to 2 [link] [source]

Windows and macOS.

These are just consumer-grade magnetic USB drives formatted with either ExFat, HFS+ or NTFS.

(4) By Harald Hanche-Olsen (hanche) on 2022-01-27 08:42:06 in reply to 3 [link] [source]

Worse, I have seen reports that cheap USB drives lie; You send the command to flush data to disk, and it reports back that it has done so, before actually doing it. That way lies data corruption and mayhem if the computer crashes or power goes out at the wrong moment. (Admittedly, this information is old and possibly outdated.)

I suppose solid state drives are more reliable in that respect, since there is no reason to reorder writes on them. But I wouldn’t know for sure.

(5) By Simon Slavin (slavin) on 2022-01-27 13:26:12 in reply to 3 [source]

The way Flash actually works is very slow. It takes a long time – many milliseconds – to write to one part of storage, and a SQLite update requires writing to many parts of storage. As a consequence the drive accepts many 'write' commands, responds to each one instantly as 'I have completed this', then does the writing in its own time, in whatever order it finds convenient ("out-of-order writing").

If a Flash drive waited for each write command to be complete before it responded 'that write is complete', you would find the drive unusablly slow. The fact that it takes a long time for writes to be actually complete is the reason why you have to tell the OS "eject drive" and then wait for a while before it tells you it's safe to eject the drive. The OS is waiting for the drive to tell it that all writes are really complete.

There is no way SQLite can be ACID under conditions like this, since the storage system is not only lying to it about which writes are complete, but doesn't even do the writes in the right order.

You may think you've found a pattern of writing, and some SQLite settings, which result in uncorrupted databases, but you've actually just found something that works under one scenario. Use different hardware. Or have another program writing at the same time (even to a different drive). Or have the caches start full instead of empty, or have the caches start empty instead of full. Use a faster or slower Flash drive. Do updates faster or slower. Then pull the Flash drive while it's being written to. Any of these can result in a corrupt database.

Sorry, but consumer-grade Flash drives, hardware failure (e.g. pulling the drive before the OS says its ready) and ACID don't mix. Pick any two.

(6) By Deon Brewis (deonb) on 2022-01-30 01:13:30 in reply to 5 [link] [source]

These aren't flash drives, they're magnetic. (Spinning disks).

(7) By Simon Slavin (slavin) on 2022-01-30 01:38:40 in reply to 6 [link] [source]

So you're unplugging external hard disk drives without warning the computer. Some of what I wrote applies. If they don't have their own power supplies, and get their power from the computer via the same cable they get their data from, a lot more of what I wrote applies.

But if they don't have DIP switches or jumper selections for 'write-before-acknowledge' then you are going to encounter corruption either way. Not only inside a SQLite database file, but also in the drive's file/sector formatting. Sorry.

(8) By Deon Brewis (deonb) on 2022-01-30 01:47:10 in reply to 7 [link] [source]

I understand corruption can still happen. I'm more curious about why it happens significantly less frequently in EXCLUSIVE mode compared to shared modes.

(12) By Keith Medcalf (kmedcalf) on 2022-01-30 02:45:28 in reply to 8 [link] [source]

This does not make any sense at all, unless either the locking (access arbitration) or shared-memory (for the SHM file) is broken on one of the platforms.

I cannot see how setting the locking mode to exclusive would make any difference.

Also, data loss of uncheckpointed transactions would be further minimalized by setting synchronous=full (at the price of flushing and syncing the wal at every transaction commit).

(13) By Simon Slavin (slavin) on 2022-01-30 08:20:57 in reply to 8 [link] [source]

I can't think of a reason for this. I see Keith Metcalf has commented similarly, and he knows more of the internals of SQLite than I do. Perhaps someone from the development team can think of a reason.

Your 'synchronous' setting makes no sense to me too. The logical 'synchronous' setting for unexpected hardware failure would be 'FULL'. Of course, this will slow SQLite down.

(14) By Keith Medcalf (kmedcalf) on 2022-01-30 21:08:50 in reply to 13 [link] [source]

Well, SYNCHORONOUS=FULL -> SYNCHRONOUS=NORMAL when in WAL journal_mode will remove the sync operation against the WAL file at transaction commit (thus speeding up transaction commit) and will instead do that sync at checkpoint (flush WAL prior to copying the WAL pages to the main database) instead.

This means that the database integrity is pretty much maintained "from checkpoint to checkpoint" but not "transaction to transaction". Some committed transactions since the last checkpoint may be lost, but it should not otherwise affect the capability of protecting the main database from corruption.

LOCKING_MODE=EXCLUSIVE, other than the effect it has of holding transaction locks, combined with WAL journalling, is usually meant to allow the opening of a WAL journaled file where shared-memory does not work properly (by avoiding the need to have an SHM file -- since only one connection has exclusive access, there is no need to "share" the WAL index with other processes).

(15) By Max (Maxulite) on 2022-01-31 06:45:10 in reply to 8 [link] [source]

It's interesting, but at least partially I can confirm your observation.

A little background. There was a thread 10 years ago (https://www.mail-archive.com/sqlite-users@mailinglists.sqlite.org/msg58872.html) and during the long-lasting discussion I made a test for removable media investigating what was the cause of the inconsistencies when the media is disconnected.

The test itself is simple. We hava a table

CREATE TABLE Test (id integer primary key autoincrement)

with a one row Id 0

After that the test starts to execute a series of queries (every one transactioned)

UPDATE Test SET Id=Id + 1

In the middle of the series we remove the media. Further we compare two states

The state expected from sqlite point of view that made everything is possible to ensure that n steps were written to disk, but complaining that n+1 step is failed.
Actual state of the data when we reinsert the media. So we expect (id = n) value in the base or some other (id < n) value showing that regardless of the job the sqlite was doing (FlushFileBuffers or similar) something wasn't written in time.

The tests ten years ago confirmed that it was the hardware to blame, that reported the the OS that this non-caching operation was finished when it was actually not. You can read the relevant part of the discussion. But it's not what interesting here (in relation to your original question)

I decided to repeat the tests today with Windows 10 and some bus-powered WD 1200BEV hdd disk with NTFS on it just to make sure nothing changed on the OS level. The test was with non-wal journal_mode=delete mode, locking_mode = NORMAL, with both quick removal and optimize for performance settings in the OS. Every step (update statement) was transactioned. The tests still showed that no matter was sqlite did, the id value on reinserted media was lower that expected, the difference between 50 and 200 (the speed of commits was about 200 per second).

While performing the tests, I made the reinsert part dual. In one case, I copied the db, but no journal. One would expect the db to contain the value from the last attempt (reported as failed) or the previous one (n-1). in the other variant, both main and journal were copied. In this case we expect the previous value. In any case, common sense tells that the difference between these two openings should not be more that 1 (either sqlite is ok with both versions of the main db file or it rolls back from journal in the second one, but no more than one). But in reality the difference was huge, from 100 to 300. So it looked like the main file and journal file were from different generations (if the generation here is a single step of the test).

And here the part about the locking_mode. When I switched the locking_mode to EXCLUSIVE, the difference between reopening with journal-exists and no-journal was gone. The several tests showed absolutely identical values so somehow confirming the same generation of the files.

I suspect that regardless of the mode (wal, non-wal), there's something in EXCLUSIVE mode (probably OS-related) that ensures that if sqlite works with several files (main/journal, main/wal) they stil might be too old due to faulty hardware logic, but at least they're close in time to each other.

But I'm not absolutely sure about my conclusions, so it's better to reevaluate them with some other tests.

(9) By Larry Brasfield (larrybr) on 2022-01-30 01:56:09 in reply to 3 [link] [source]

These are just consumer-grade magnetic USB drives formatted with either ExFat, HFS+ or NTFS.

(Realizing that opinions number similarly to alimentary canal exits ...)

It would be interesting to see how many corruption incidents you see with HFS and ExFAT versus NTFS. The latter is a journaling filesystem, one of a class which is well known to be more robust against interrupted modify operations. I think if you could persuade the users to use either NTFS or HFS+ ^a, they would not suffer so many corruption incidents.

a. HFS+ was a successor to HFS for which journaling and other features were added.

(10) By Deon Brewis (deonb) on 2022-01-30 02:04:06 in reply to 9 [link] [source]

The problem we have is some people want to move the USB drive between Mac and PC, so that leaves ExFAT as the only out-of-the-box supported common option.

(11) By Keith Medcalf (kmedcalf) on 2022-01-30 02:07:02 in reply to 9 [link] [source]

You cannot use both NTFS and "Optimize for Slowness (Quick Removal)" at the same time. NTFS is so highly inefficient that if delayed-write is disabled at the Operating System level then it will not permit you to initialize the disk in a format (such as NTFS) that needs all the performance help that it can get.

NB that the first sentence is inaccurate in that the Operating System will do what it can to dissuade you from putting an NTFS filesystem on a "Quick Remove" drive, but will not prevent you from putting a previously formatted NTFS filesystem into a "Quick Remove" connection. I/O will just be slower than molasses running uphill in a -40 degree Edmonton winter.