How to corrupt - windows

(1) By Bill Wade (wwade) on 2021-03-09 14:00:44 [link] [source]

Assuming that my pragmas are properly set so that sqlite calls FlushFileBuffers() at the appropriate times.

Does the Windows setting "Enable write caching on the device" specifically allow Windows to return from FlushFileBuffers() too soon? Here I'm assuming that cheap hardware didn't lie to Windows in the first place.

Postgres reliability documentation at https://www.postgresql.org/docs/13/wal-reliability.html says:

On Windows, if wal_sync_method is open_datasync (the default), write caching can be disabled by unchecking My Computer\Open\disk drive\Properties\Hardware\Properties\Policies\Enable write caching on the disk. Alternatively, set wal_sync_method to fsync or fsync_writethrough, which prevent write caching.

The Alternatively section implies that if even if write caching is enabled (at the OS level), calls to fsync (and presumably FlushFileBuffers) will send pending writes to the hardware.

I don't see anything in the FlushFileBuffers() documentation that says it is allowed to return early, but I'm no expert in that level of Windows i/o.

(2.1) By Keith Medcalf (kmedcalf) on 2021-03-09 17:49:32 edited from 2.0 in reply to 1 [source]

The answer varies by the actual hardware and the version of Windows. Assuming hardware which works correctly and Windows 10 / Server 2012 Windows has the following checky-box options for DASD devices:

Enable write caching on the device and a subselection Turn off Windows write-cache buffer flushing on the device

These options do EXACTLY and precisely what they say (which was not the case for any prior version of Windows).

So, in Windows, ALL DASD I/O, unless told otherwise when opening the file, is to the Windows Filesystem cache and not directly to the underlying hardware. That is, all I/O is a Windows cache read/write, although you (the programmer) may choose to require a file handle (when opened) to use "direct I/O" and bypass the Windows Filesystem cache. In practice, almost nothing does this. SQLite3 does not request direct I/O either -- it does I/O in the ordinary course through the Windows filesystem cache. You cannot even force this option because if you want to do direct-I/O you have to only do I/O on correct device data boundaries and SQLite3 does not do this, so even if you attempt to set direct-I/O mode it will fail due to mis-alignment of reads and writes.

Windows will "write" data from the Windows filesystem cache to the underlying hardware when it feels like it (it is basically an LRU cache with device idle flushing but no "intelligent" cache management features). The various "FlushFileBuffers" calls force Windows to write dirty cache pages associated with a specific file to the underlying physical storage device.

The Enable write caching on the device allows device writes to return success as soon as the write data has been handed off from the Windows DASD driver to the physical device irrespective of whether or not the data is actually "written" to the device (that is, it allows the "physical device" to "do its own thing" with the data outside of the purview of the Windows I/O layer). If this is disabled (unchecked) Windows attempts to require that the device driver not return success until the data is actually securely and irreversibly written to the storage media -- whether this actually happens is up to the underlying hardware and whether it is lying or telling the truth when it says the write is complete. It is a problem of that hardware and not under anyone's control (well, technically, it is under the control of the person who is buying the hardware to only buy hardware which behaves properly and to allow companies selling lying hardware to go bankrupt).

Effectively, this controls the level of supervision that Windows exercises over data written to the DASD device. If the box is unchecked, then Windows tells the device that it should not report success until success is achieved. If the box is checked, then that means that the device may report success as soon as it has the data in hand, whether or not is has actually been written.

Whether this "does anything" or not is up to the particular hardware that you have purchased. Higher-end hardware will "work properly" and low-end hardware will oftentimes "lie" in order to make the uneducated believe that they are getting Lamborghini like performance from a go-kart.

The sub-option Turn off Windows write-cache buffer flushing on the device is a further command to Windows on whether or not to pass through a FlushFileBuffer operation to the device or whether to run in "fake Lamborghini" mode by never passing though flush operations to the underlying hardware.

If the hardware is working correctly then the write-caching should be enabled (checked) and the turn off flushing disabled (unchecked) for maximum performance. (write-caching on the device is required in order for correctly working devices to do advanced things such as data-phase disconnect, command queueing, elevator seeks, etc).

Whether these checkboxes actually do anything at all is up to the individual device driver and individual hardware.

(3.1) By Keith Medcalf (kmedcalf) on 2021-03-09 18:13:29 edited from 3.0 in reply to 1 [link] [source]

The succinct reply to the original question should be that Flush operations are passed through to the underlying hardware by default, unless disabled. These flush operations are independent of the enabling of device level caching and may be controlled separately if and only if caching is enabled.

This is with the caveat that this is only guaranteed true for correctly working hardware on Windows 10 / Windows Server 2012 or later.

In other words, the original statement about device caching is incorrect and biased towards deprecated versions of Windows or is predicated on the behaviour of badly designed software.

(4) By Bill Wade (wwade) on 2021-03-09 20:19:06 in reply to 3.1 [link] [source]

Thank you Keith.

Those replies are very helpful to me.

(5.2) By Keith Medcalf (kmedcalf) on 2021-03-09 21:06:17 edited from 5.1 in reply to 4 [link] [source]

So the summary behaviour should work as follows:

program I/O is performed between the Windows Filesystem Cache and the user program unless the file is opened for DIRECT I/O
the Windows Filesystem Cache does I/O (using the pager subsystem) to and from the underlying storage device based on its own LRU algorithms
the "Enable Device Cache" controls whether or not Windows supervises "write" operations only to the device cache level (enabled) or requires the device to report actual completion to storage media (disabled)
Flush operations on files (or entire disks) force Windows to flush all associated changed blocks to the underlying device (which may or may not have device caching enabled).
Subsequent to the Flush of the Windows filesystem cache to the device, a Flush operation will be sent to the underlying hardware. If device caching is enabled, then this additional flush can be explicitly disabled. In theory, if device caching is disabled there is no need for this additional device flush because the initial write was not cached, which is why this is only available if device caching is enabled.

Generally speaking this is designed for the optimization of device level caching, or caching disk controllers (eg, high end RAID controllers with their own cache RAM).

If the device cache is disabled, all I/O (from the Windows Filesystem Cache) is supposed to only be posted (succeed) after write to the physical media is completed.

However, when device caching is enabled, then writes are only guaranteed (by Windows) to have passed the data the the underlying device/controller for it to handle. (eg, send the write block to the RAID controller cache memory for it to write). Note that all "modern" hard drives have device cache in them these days, but that this cache is generally not power-safe.

If and only if device write caching is enabled, then Windows will automatically follow-up all Windows cache flushes with a device cache flush, which should result in a supervised (ie, no return until complete) flush of the device cache to physical media. This additional flush can be disabled (for example, if the RAID controller has its own battery backup so that writes to the device cache will not be lost). In theory, if the device cache is disabled there is no need for the flush operation to be sent to the underlying device hardware and as far as I know it is not.

As far as I know, Flush operations are not sent to the device if "device caching" is turned off since writes from the Windows Filesystem cache to the device should not be cached by the device. I think the only exception to this is during shutdown when a flush is sent to all devices that report support for device flushing "just in case".

(6) By Keith Medcalf (kmedcalf) on 2021-03-09 21:10:12 in reply to 5.2 [link] [source]

Also note that this applies to "permanently attached" storage. Hot-plug removable devices (eg, USB drives) by default do not cache writes in the Windows Filesystem Cache unless explicitly enabled.

(7) By Simon Slavin (slavin) on 2021-03-10 13:24:33 in reply to 1 [link] [source]

Wanted to pick up on a statement Keith's responses didn't cover:

Here I'm assuming that cheap hardware didn't lie to Windows in the first place.

The opposite is true. Cheap storage systems have fudges in to appear fast without using expensive fast components. One of these is that they will accept a 'write' instruction and say "Okay, I've got that." to Windows before they've written the new data to storage. This lets the manufacturer quote a fast write time on the specs. If power fails at that point, the new data is lost. And if several 'writes' are buffered, they may be executed out of order. If a DBMS is relying on ACID, it will be broken.

Expensive storage subsystems, sold for use in servers, (adverts say things like "ACID, 24/7, in-order writing" do this properly. That's why they're expensive: they need additional hardware to meet the claims they quote.

In other words, cheap hardware is exactly the hardware which will lie to Windows. Truth costs money.

(8) By Bill Wade (wwade) on 2021-03-10 16:22:41 in reply to 7 [link] [source]

I don't believe my app, running on my customer's hardware, can reasonably determine if that hardware is telling lies.

My "file" is a data historian. Implemented as a 'dumb' file (periodic flushes of the data, with embedded checksums, all writes are "fixed-size-appends" but in a fixed-size circle) having a few recently "written" pages corrupted by a crash is not a terribly big deal. I likely lose more data than that during the time it takes for the machine to reboot and reestablish network communications.

Whatever mechanisms is in place to recover the data lost during the crash, can also recover a few recently-written pages that got corrupted because the OS or hardware lied to me. If the mechanism is "do nothing" it just means that a "few minute outage" loses a few more seconds of data. Some customers may be happy with that because "do nothing" has low up-front cost, and crashes are relatively uncommon.

If I replace my 'dumb' file with an SQLite database, I get many advantages, and I want some of them.

One worry I have is that a few pages of corruption make the file unusable (or more optimistically, unusable until some expensive process involving .restore can complete).

Right now I'm trying to determine what I need to do (or tell my customers to do) to make that corruption less likely, and recovery less expensive.

(9) By Keith Medcalf (kmedcalf) on 2021-03-10 17:10:18 in reply to 8 [link] [source]

How well that works depends on the definition of "periodic" and what the underlying storage device is, and the size of each "periodic" transaction.

So if you are attempting to log a data collection ofa few thousand tags (assuming engineering values here) scanned every second and each tag write is a separate transaction and the underlying storage is spinning rust, you are likely to be very unhappy. (The absolute best transaction rate you can expect on spinning rust is about 1 transaction per approximately 60 revolutions of the spinning rust, assuming no other device traffic and transaction sizes of less than a cluster).

If however you are batching the those tag writes into one transaction per second it is likely that spinning rust can keep up (unless you use a very badly designed database schema) with loads of I/O capacity to spare. I/O rate will be your limiting factor.

Any hardware that you did not buy at BestBuy/Walmart (or equivalent) is likely going to be fine for the task. (That is, do not expect to run something that needs to be reliable on a $200 bargain-basement sale computer -- such machines are optimized for price by using the least cost (and least reliable) parts from the lowest cost supplier).

Also be aware that ACID data requirements can be brutal on cheap USB/flash/ssd storage that is not designed for the task.

I've written software (in Python no less) that scans out a couple hundred engineering values per second from a Matrikon OPC server (and from Experion) and uses SQLite3 for logging without much difficulty using standard PC hardware (as in ~$1000 reasonable quality hardware). Generally the difficulty is on the data collection side, not on the logging side.