SQLite Forum

Network filesystems for sqlite DBs
Login

Network filesystems for sqlite DBs

(1) By Spaced Cowboy (SpacedCowboy) on 2020-09-15 23:13:35 [link]

Right, so first off, I understand the basic advice is "Don't do it". I'd like to understand the reasons *why* not to do it, and if any of those reasons could be mitigated in any way.

I'm trying to figure out if we could use an existing enormous iscsi-type shared filesystem to allow a database-per-person (which would allow easy per-person encryption for privacy) using a bunch of sqlite databases, or go with one humungous database server with everyone's data inside it running with a local disk. Both approaches have their ups and downs and to that end I wrote a small test, the guts of which are:

````
    /**************************************************************************\
    |* Loop for the number of times, reading and incrementing the count
    \**************************************************************************/
	for (int i=0; i<count; i++)
		{
		mgr->beginTransaction();
		int64_t value = mgr->intForQuery("SELECT value FROM counter");
		mgr->update("UPDATE counter set value = ?", value +1);
		mgr->commit();
		}
````

which would read and write in a loop (yes, I know this could be one statement, I wanted to make sure reads & writes in a  transaction wouldn't be compromised). The loop is constantly obtaining and releasing the lock on the database (mgr->beginTransaction() calls "BEGIN EXCLUSIVE TRANSACTION"). The idea is to introduce as much contention as possible to see if the transaction locking logic breaks down under stressful circumstances.

After all the updating, the code prints out the final value, and the idea is to run more than one instance of these programs in parallel, and see if the last result is (instances) * (count) in the database - ie: was there any time when the database transaction wasn't honoured ? Obviously I don't care about anything other than the last value, because the intermediate values are going to vary based on which program got more access at any given time.

The results were *kind of* interesting - it did better than I expected over a network... Listing them in increasing performance:

````
Local single NVME SSD:
[@lyonesse DCtest]$ ./RUNME 100000
Connected to resource: /tmp/test.db
Connected to resource: /tmp/test.db

worker  0 value = 199465 after 2089.432570 secs
worker  1 value = 200000 after 2095.206688 secs



NFS, 1G ethernet:
[@lyonesse DCtest]$ ./RUNME 100000
Connected to resource: /archive/tmp/test.db
Connected to resource: /archive/tmp/test.db

worker  1 value = 197756 after 734.510462 secs
worker  0 value = 200000 after 742.014655 secs



iScsi, 1G ethernet:
[@lyonesse DCtest]$ ./RUNME 100000
Connected to resource: /mnt/iscsi/tmp/test.db
Connected to resource: /mnt/iscsi/tmp/test.db

worker  0 value = 175305 after 342.477881 secs
worker  1 value = 200000 after 392.721822 secs



Local 16-disk RAID:
[@lyonesse DCtest]$ ./RUNME 100000
Connected to resource: /raid/tmp/test.db
Connected to resource: /raid/tmp/test.db

worker  1 value = 177115 after 69.417077 secs
worker  0 value = 200000 after 77.233248 secs
````

Points of interest:

1) The final value of the counter after the last worker exited was always what you would hope it to be, even with NFS.

2) iScsi is a lot faster than NFS :)

3) Local (16-way with cache, Areca) RAID is a lot faster than any network, as expected.

4) A single local PCIe-based SSD is nowhere near any of the others.


The only one I'm really concerned with here is (1). The speed hit vs local RAID I'm completely comfortable with, it's an adequate trade-off for the situation and the actual deployment would be 10G not 1G ethernet so it's likely to be better than that. The situation doesn't change with multiple programs accessing the same database either, example:

````
[@lyonesse DCtest]$ ./RUNME 100000 5
Connected to resource: /mnt/iscsi/tmp/test.db
Connected to resource: /mnt/iscsi/tmp/test.db
Connected to resource: /mnt/iscsi/tmp/test.db
Connected to resource: /mnt/iscsi/tmp/test.db
Connected to resource: /mnt/iscsi/tmp/test.db

worker  4 value = 444736 after 921.902477 secs
worker  1 value = 468154 after 968.464125 secs
worker  0 value = 485641 after 1005.294065 secs
worker  3 value = 495642 after 1025.872606 secs
worker  2 value = 500000 after 1034.514943 secs
````


Questions are:

a) Is the above the sort of test I ought to be running in order to try and figure out what areas could have issues, or is there some better way to show up any problem ? 

b) Am I just being lucky that there's no impact on the reliability over the network in the above ? If so, is there a good way to make it more challenging and flush out problems ?

c) Given that SQLite is supposed to support multiple threads/processes accessing the same file on a *local* filesystem, then if I could create a VFS layer that provided the same semantics but stored to the network, would that be a more reliable route ? (Assuming I'm missing something with the above results :)

d) Or is there something more fundamental that I'm just getting lucky with, and the whole idea is a non-starter. If this is the case, I'd love a pointer to somewhere it's explained, or some insight :)

A last note. I'm coming to this with an open mind - if there's a genuine reason why it can't be done, I'm very happy to know that. I just like to have something real and tangible to point to, to say "that's why"... :)

(2) By anonymous on 2020-09-16 00:16:02 in reply to 1 [link]

Check out Warren's post:
<https://www.sqlite.org/forum/forumpost/3570bdd646?t=h>

(3) By Keith Medcalf (kmedcalf) on 2020-09-16 00:18:12 in reply to 1 [link]

Firstly, iSCSI is not a filesystem but rather a block transport protocol.  Where is the filesystem located?  It does not matter whether the SCSI cable is a couple of inches or thousands of miles long, nor whether it is parallel or serial or glass fiber; nor does it really matter if the "transport" layer for the SCSI protocol is differential direct signalling, single-ended direct signalling, phase-shift or pulse-modulated signalling, or via some other transport such as ethernet, IP, IPX/SPX, DecNET, GPIB, TCP/IP, UDP/IP, postal mail or via secretaries and airplanes.  iSCSI simply transports blocks between a "storage device" and "a filesystem".

What is important is the filesystem and where it is located and that it works correctly, not the mode by which the filesystem transports blocks to and from the storage device.

Secondly, it is likely that what you are seeing with respect to "RAID" is probably due to the fact that the "filesystem" placed over top of the RAID controller is either (a) choosing performance over reliability by "cheating" or (b) is a professional grade block caching controller that costs quite a lot of money and that high price is to allow "performance" while still remaining "reliable" (this is uncommon except on very high cost professional stand-alone RAID controllers).

Thirdly, you do not specify that your NFS tests are actually multileaved.  That is, each one of your "testbeds" must be executing on a separate machine from each other and each of those separate from the machine "serving" the filesystem, otherwise you results are not actually testing distributed multiple access arbitrage to a remote filesystem.

To summarize, the root of the issue for *all* systems which have multiple access to a file located on a "remote" filesystem is the overhead and work that is required to maintain consistency of the "remote view" of the file (and a drectory is nothing more than a file containing directory entries, and a filesystem a directoy file of other directory files) amongst multiple accessors located on different machines.  Maintaining this consistency with "full reliability" is extremely expensive and slow.  Because a "remote shared filesystem" is generally used to permit access BY ONE client to only ONE FILE at a time (not arbitrate multiple access to the same file) most "remote filesystems" implement lots of shortcuts and are ill-prepared to deal with multiple access to the same file -- that is they assume that the usage pattern will be single access per file and are optimized for that scenario and are inherently unreliable (or very slow) when they must deal with arbitrage of multiple access to the same file (unless that access is read-only).

Consider that when a "local filesystem" has to maintain consistency of view of a single file between multiple accessors, the "turn around" latency for doing so is relatively tiny -- it is local after all -- and often measured in nanoseconds.  But to do the same thing over a network requires "adding" a "network turnaround latency" to and from each distributed node accessing the file for each operation, to achieve the same result.  This is a "hard problem" and most solutions heavily trade "opportunism" for "reliability".

This is why just about every single database product requires the filesystem to be "local" or, if remote, then"exclusive access" in order to work reliably.

(4.1) By Warren Young (wyoung) on 2020-09-16 02:12:47 edited from 4.0 in reply to 1 [link]

> I wrote a small test, the guts of which are:

Why not use one of the ones included with SQLite, so we can all play along?

* [`test/kvtest.c`](https://www.sqlite.org/src/file/test/kvtest.c) from the "[35% Faster Than The Filesystem](https://www.sqlite.org/fasterthanfs.html)" article.

* [`test/speedtest1.c`](https://www.sqlite.org/src/file/test/speedtest1.c)

> `"UPDATE counter set value = ?", value +1`

What realistic scenario does a pure overwrite test simulate?

DB applications are usually read heavy.

Even a data logging application is going to be appending pages to the DB, not continually overwriting the same DB page.

Further, your application is perversely reimplementing AUTOINCREMENT. What possible correlate to a real-world application does this model?

> The idea is to introduce as much contention as possible to see if the transaction locking logic breaks down under stressful circumstances.

Are you suggesting that [SQLite's test regime](https://sqlite.org/testing.html) might be insufficient for ensuring that transactions work as they're supposed to?

> Local single NVME SSD:...2095.206688 secs

This result is likely due to your constant overwriting of a single DB page. If SSD is a sensible choice for a given application, it's because it's read-heavy. A write-heavy DB should not be put on an SSD unless you've done the math and figured out that the [lifetime SSD endurance rating](https://blog.westerndigital.com/ssd-endurance-speeds-feeds-needs/) of the disk will suffice for your application. 1 write per second is a very different application than 10k writes per second.

> NFS...742.014655 secs

This should have been a huge red flag to you: NFS is 3x faster than a local filesystem!?

It doesn't pass any kind of smell test you can come up with. Contemplate SATA's 6 Gbit/sec rating, for example, which is the bottleneck they had to invent NVMe to get past. Now you're saying a 1 Gbit/sec NFS link is faster?

> iScsi

You haven't said what filesystem you're running over iSCSI. [You can't share a filesystem over iSCSI without using a clustering filesystem](https://serverfault.com/q/633556/18096), and I'm guessing you aren't. If you were, I'd expect to see the naked clustered filesystem in a separate test.

Unless your usage model is that only a single computer is accessing this DB over iSCSI, I doubt this test is telling you anything useful.

> Areca RAID is a lot faster than any network, as expected.

You say "with caching," which suggests you have [write-back caching](https://www.geeksforgeeks.org/write-through-and-write-back-in-cache/) enabled, as opposed to the safer write-thru mode. That's a highly risky cheat in any application that cares about durable transactions unless you've got battery-backed cache RAM to close the [RAID write hole](https://en.wikipedia.org/wiki/RAID#Atomicity).

Contrast your NVMe drive, which is almost certainly honoring SQLite's fsync calls. The Areca RAID is effectively not doing so, promising it's written the data to disk even when it hasn't. Anything that isn't battery backed will be lost if the system reboots unexpectedly.

> The final value of the counter after the last worker exited was always what you would hope it to be, even with NFS.

Given that fsync is a kernel syscall and NFS is implemented at the kernel level on most OSes these days, I doubt that test shows what you think it does. Until you have 2+ different OS+hardware stacks accessing the same DB, you aren't fully testing ACID over NFS.

> is there a good way to make it more challenging and flush out problems ?

What is your reason for looking into SQLite on network filesystems in the first place?

If it's just so you can put a single box's storage somewhere else, as with NAS or SAN setups, then your test may be approaching something useful, once you address the criticisms above.

If you're trying to get to a point where multiple machines are sharing storage, then your test doesn't show anything useful at all. You need to run it from multiple physical nodes against the same DB to show that.

> if I could create a VFS layer that provided the same semantics but stored to the network, would that be a more reliable route ? 

I think you'd end up reinventing an existing wheel, is what.

> I just like to have something real and tangible to point to, to say "that's why"... :)

Greybeards on the Internet saying "Don't do that" isn't good enough? :)

What, then, is the value of experience?

(5) By Warren Young (wyoung) on 2020-09-16 01:21:24 in reply to 4.0

> Greybeards on the Internet saying "Don't do that" isn't good enough? :)

Actually, I can do even better than that: [Amdahl's Law](https://en.wikipedia.org/wiki/Amdahl%27s_law).

And yes, I'm aware that [argument from authority](https://en.wikipedia.org/wiki/Argument_from_authority) is an *informal* logical fallacy. That doesn't mean authority's wrong, though. If we agree on the authorities, it's actually a pretty good guide.

(Thus COVID, cough, cough.)

(6) By Spaced Cowboy (SpacedCowboy) on 2020-09-16 15:15:19 in reply to 3 [link]

Thanks Keith, 

That's the detail level I was after. The filesystem running on the NFS server is [btrfs](https://btrfs.wiki.kernel.org/index.php/Main_Page) btw.

It occurred to me (after posting, of course) that the NFS tests ought to be running from multiple machines to test it properly, it wasn't, in fact. 

I'm perfectly happy to trade speed for scale. Unless it's truly glacial in nature, the speed of the filesystem isn't going to the the limiting factor. Reliability of course is a totally different matter.

(7) By Spaced Cowboy (SpacedCowboy) on 2020-09-16 16:11:45 in reply to 4.1 [link]

> Why not use one of the ones included with SQLite, so we can all play along?

Two reasons: (1) it didn't occur to me, and (2) I didn't know about them :)

> What realistic scenario does a pure overwrite test simulate?
> DB applications are usually read heavy.
> Even a data logging application is going to be appending pages to the DB, not continually overwriting the same DB page
> Further, your application is perversely reimplementing AUTOINCREMENT. What possible correlate to a real-world application does this model?

In this case, the database holds a lot of different types of data. Quite a bit of it is stored once and then only read later and I didn't really see that as the problem. We also have a fair amount of persistent state for a client application, state that is constantly being modified to reflect what the user is doing, and which they expect to be preserved if they move away from the native app and use the same account on the web-app. It's obviously not a simple counter, but the flags are being set and reset constantly, and the counter allows me to tell whether the end-result is what I'd expect it to be. 

It seemed like a reasonable analogue of what I thought the database might have problems with doing.

> Are you suggesting that SQLite's test regime might be insufficient for ensuring that transactions work as they're supposed to?

Well, no, under *normal* circumstances, of course not. It did seem that if the database-locking was problematic over a network filesystem, though, that this might be a test that would show that up.

> This should have been a huge red flag to you: NFS is 3x faster than a local filesystem!?

It was, in fact. Lacking any other ideas, I'd come to the same conclusion as you did that it was the write-heavy environment that was the cause. It still seems odd though.

> You haven't said what filesystem you're running over iSCSI

In this case, it's my home setup (covid and the entire world burning means I'm working at home) but if it's possible to make it work, the real system would be running on the existing [scaleio](https://en.wikipedia.org/wiki/Dell_EMC_ScaleIO) setup. iscsi is the closest thing I can get to it for testing out ideas before going through the whole "submit a proposal in triplicate, convene a committee, ..." 

To answer your question, I'm running Btrfs, which isn't a clustered filesystem AFAIK. I'm not actually sure what the "real" system uses, but it pretty much has to be a clustered filesystem for the existing usage.

> You say "with caching," which suggests you have write-back caching enabled, as opposed to the safer write-thru mode. That's a highly risky cheat in any application that cares about durable transactions unless you've got battery-backed cache RAM to close the RAID write hole.

... which it does. And the house has 3 Tesla batteries on it as well. I laugh in the face of power outages. Well, actually I don't. I quietly smile on the inside when the neighbor comes around asking why my lights are still on in a power outage, and offer to throw a cable over the fence to power his fridge.

> What is your reason for looking into SQLite on network filesystems in the first place?

From the original post:

"I'm trying to figure out if we could use an existing enormous iscsi-type shared filesystem to allow a database-per-person (which would allow easy per-person encryption for privacy) using a bunch of sqlite databases, or go with one humungous database server with everyone's data inside it running with a local disk."

The multi-tenancy huge-database is the obvious solution, but it doesn't lend itself well to searches across the metadata in a per-user encrypted-data scenario. Encrypting columns of data effectively removes the ability to index things, homomorphic encryption notwithstanding, and not encrypting them leaks privacy.

Conversely, having a bunch of SQLite databases on a shared filesystem accessible to a bunch of nodes makes life a lot easier for the encryption scenario (things like [SqlCipher](https://www.zetetic.net/sqlcipher/) mean the database can be encrypted en toto but still use indexes as if it were not. In my tests on low-data volume databases, the overhead is only ~5% or so rising to ~15% on data-heavy tables). Managing a large number of individual databases rather than one large one has its own issues, of course. 

So what I'm trying to do here is do the groundwork to see what's even feasible, after which we can move onto the proposal stage.

There's a few things we can do at ingres - one of which would be to divert all traffic for a given user account to a specific node, thus giving us single-node access to any given database - and it sounds like this might be feasible from yours and Keith's response. 

The "I want it all" approach would be the multiple writer scenario, where read and write requests (the client app can open multiple simultaneous connections to the service) would come in over different load-balanced IP's, thus giving us multiple-writer requirements. Again, from reading both the responses, I need to re-run those tests from different machines (which did occur to me, but only after posting).

> Greybeards on the Internet saying "Don't do that" isn't good enough? :)

As a greybeard myself, albeit in a slightly different sphere, I wish that were true, unfortunately when I'm challenged about why we want to do it *this* way, it's best to have an answer somewhat more authoritative than "well, some bloke on the internet, who *sounded* like he knew what he was talking about, said X" :)

I do appreciate the time taken to answer the question, thank you :)

(8) By Warren Young (wyoung) on 2020-09-16 16:31:41 in reply to 7 [link]

> Quite a bit of it is stored once and then only read later and I didn't really see that as the problem

That will perform much better on an SSD than your test predicts.

> We also have a fair amount of persistent state for a client application, state that is constantly being modified to reflect what the user is doing

I can tell you from experience that you want to batch that up and save many state changes all at once. You might have an event that fires once a second which gathers all of the "dirty" state records and shoves them into a single transaction. Committing that transaction will carry a *lot* less overhead than changing one POD value (int, char, bool, string...) per DB hit.

> they expect to be preserved if they move away from the native app and use the same account on the web-app.

That's where my "1 second" value above comes from: long enough that you're likely to get multiple users' updates in each pass, yet short enough that there's no way a user is going to switch back and forth fast enough to notice a discrepancy.

You might push it to 5-15 seconds, depending on testing.

> if the database-locking was problematic over a network filesystem, though, that this might be a test that would show that up.

Only if you run it from multiple nodes. Two processes running their requests through a single node's NFS implementation is going to make that single NFS stack honor locking and such for both processes. It isn't until you force coordination of locking through the NFS *server* on the other end of the network pipe that you test the actual issue here, the one [the FAQ warns against](https://sqlite.org/faq.html#q5).

(drh is one of those graybeards I was speaking of. Why are you second-guessing him on this?)

> It still seems odd though.

Informed guess: you aren't testing against NVMe on the server side of the NFS connection.

It's an apples-to-zebras test to compare NFS vs NVMe unless what you actually mean is NVMe via NFS vs local NVMe. You get sent off into the weeds when you try to reason from your observation that the apples lack stripes and that the zebras, while they *do have* stripes, do not exhibit the expected deep red and pale yellow stripe pattern.

> covid and the entire world burning means I'm working at home

There are several organizations that will happily rent you a Gluster or Ceph cluster for an hour or two.

(9) By Spaced Cowboy (SpacedCowboy) on 2020-09-16 17:07:01 in reply to 8 [link]

> That will perform much better on an SSD than your test predicts

Yep. The test wasn't really about the SSD performance though. 

> I can tell you from experience that you want to batch that up and save many state changes all at once

So there's another wrinkle. The current system runs with a REDIS database that manages all the flags for the app, and as I said, the app will open multiple connections to the service at once, which means it gets different IP addresses on the load-balancer and so different connections to the REDIS database. If the app round-robins the commands it sends, and command N sets state to X, then command N+1 depends on that state, we have a problem if we're batching things.

Performance isn't something I'm overly concerned with, at least compared to reliability and correctness. Batching is a good idea in general, don't get me wrong, but specifically here I think it could be a problem. I don't get to change the way the client application works apart from under *really* difficult circumstances.

> Only if you run it from multiple nodes

Understood, and I intend to rectify that. It will be interesting to see if the results over the network are even vaguely similar.

> Informed guess: you aren't testing against NVMe on the server side of the NFS connection

Possibly. The NFS server in question is an Asustor AS6510T, with 10 x 16TB drives, 32 GB of RAM and 2TB of NVME SSDs as cache to the array. The SSD's on the device are specifically to cache the RAID (they don't contribute to the storage volume) but it's entirely possible it is using the hierarchy of RAM <-> SSD <-> HDD, given that it's a special-purpose device. And yes, because I'm paranoid, it does have its own UPS*.

> There are several organizations that will happily rent you a Gluster or Ceph cluster for an hour or two

The last time I tried to set up an AWS account for a quick test, it cost me ~$500 or so. I'd not noticed the various data-transfer fees, and when I *thought* I'd closed everything down, I'd missed something (can't remember what, now) so I paid a month's usage of that. These days I'm very leery of setting up things on time-and-data-based services. 

The suggestion has merit, if I can persuade someone else to foot any bills :)


[*] I have a bunch of UPSs that I had before the whole-house battery was installed. I deploy them judiciously.

(10.1) By Spaced Cowboy (SpacedCowboy) on 2020-09-17 17:12:49 edited from 10.0 in reply to 9 [link]

So, just to circle back, since I've spent the best part of today getting a clustered-filesystem up and running on my little raspberry Pi [kubernetes](https://0x0000ff.uk/imgs/pi-cluster/from-front.jpg) [cluster](https://0x0000ff.uk/imgs/pi-cluster/from-back.jpg)...

Here's how standard NFS did, running using Btrfs on the Asustor AS6510T, served out to three of the workers on the cluster:

````
@k8-w1:~/dbio$ dst/bin/dctest -u /archive/tmp/test.db -i 1 -c -n 10000
Connected to resource: /archive/tmp/test.db
worker  1 value = 29472 after 698.149864 secs

@k8-w2:~/dbio$ dst/bin/dctest -u /archive/tmp/test.db -i 2 -n 10000
Connected to resource: /archive/tmp/test.db
worker  2 value = 29759 after 703.140837 secs

@k8-w3:~/dbio$ dst/bin/dctest -u /archive/tmp/test.db -i 3 -n 10000
Connected to resource: /archive/tmp/test.db
worker  3 value = 30000 after 704.651374 secs
````

I then set up the iscsi target on the AS6510T to be an [ocfs2](https://oss.oracle.com/projects/ocfs2/) filesystem - Oracle clustering filesystem v2. I'll spare anyone reading this the tortuous details, but once it was up and running it was rock solid. Configuring the thing isn't exactly straightforward, though.

Running the same program targeting the iscsi mount-point resulted in core-dumps pretty much as soon as 2 machines were accessing the same file, which wasn't a great start. It seems that with the kernel personality, only flock() locking is supported, not posix locks, which is what SQLite uses by default. 

So I reconfigured the SQLite amalgamation with -DSQLITE_ENABLE_LOCKING_STYLE, fixed the compile-time errors with missing #includes, and tried again. This time the program ran but told me it couldn't open the database file when I tried passing the "unix-flock" vfs string to sqlite3_open_v2(). 

A little frustrated, I tried other vfs locking implementations, and "unix-dotfile" worked, so I could get the below comparison versus NFS. I daresay it's not as efficient as flock() would have been but for my purposes it works well enough:

````
@k8-w1:~/dbio$ dst/bin/dctest -u '/iscsi/tmp/test.db?vfs=unix-dotfile' -i 1 -c -n 10000
Connected to resource: sqlite:/iscsi/tmp/test.db?vfs=unix-dotfile
worker  1 value = 10000 after 149.560367 secs

@k8-w3:~/dbio$ dst/bin/dctest -u '/iscsi/tmp/test.db?vfs=unix-dotfile' -i 3 -n 10000
Connected to resource: sqlite:/iscsi/tmp/test.db?vfs=unix-dotfile
worker  3 value = 24595 after 455.827016 secs

@k8-w2:~/dbio$ dst/bin/dctest -u '/iscsi/tmp/test.db?vfs=unix-dotfile' -i 2 -n 10000
Connected to resource: sqlite:/iscsi/tmp/test.db?vfs=unix-dotfile
worker  2 value = 30000 after 531.742586 secs
````

Points of note:

1) Both NFS and the clustered FS still maintain the locking semantics, and I still get the correct final result.

2) For NFS, there is clearly a good spread of the locking/unlocking across the various clients, because the very first worker to return is almost at the end of the count. This seems pretty much ideal, in fact.

3) Something worked a bit differently (and it could be the dot file approach) for the clustered filesystem, because the first worker got exclusive access until it had finished, and then workers 2 & 3 fought it out between them.

I did look at using the programs that Warren mentioned (kvtest.c, speedtest1.c) but they didn't seem to be what I want to look at. kvtest seemed focussed on blobs (and I don't really have those as part of the usage model) and speedtest1 looked to just be a time-test, not a correctness test. I do, however, intend to run one of the test-harnesses - th3 seems like a good candidate.

If I can figure out why "unix-flock" isn't working for me, or perhaps replace the kernel personality of ocfs2 with a userspace one that supports posix locks, I'll try the above tests again and see if there's any speed increase - it's not exactly a priority right now though.

I am still kind of encouraged that both the methods were giving me the result I'd hoped for, which presumably means the NFS locking is implemented correctly, or at least sufficiently correctly for this test. We'll see how it works on the test harness.

[edit]

So I got the unix-flock VFS to work, basically by patching the changes [here](http://sqlite.1065341.n5.nabble.com/sqlite-users-DSQLITE-ENABLE-LOCKING-STYLE-1-doesn-t-work-on-Linux-Cygwin-td70789.html) into my amalgamation source. Once patched, it compiled without error on the PI's running Ubuntu 20.04LTS. I guess -DSQLITE_ENABLE_LOCKING_STYLE isn't normally required for Linux since most Linux filesystems will offer posix locking.

I also found [an ocfs lock test](https://github.com/ceharris/ocfs2-lock-validation) Commandline tool to verify that the ocfs2 filesystem did support flock() but not posix-locks. 

The results using flock() rather than dot-file locking were in fact speedier:

````
flock() iSCSI

@k8-w1:~/dbio$ dst/bin/dctest -u 'sqlite:/iscsi/tmp/test.db?vfs=unix-flock' -i 1 -c -n 10000
Connected to resource: sqlite:/iscsi/tmp/test.db?vfs=unix-flock
worker  1 value = 15671 after 144.488417 secs

@k8-w2:~/dbio$ dst/bin/dctest -u 'sqlite:/iscsi/tmp/test.db?vfs=unix-flock' -i 2 -n 10000
Connected to resource: sqlite:/iscsi/tmp/test.db?vfs=unix-flock
worker  2 value = 29791 after 306.061729 secs

@k8-w3:~/dbio$ dst/bin/dctest -u 'sqlite:/iscsi/tmp/test.db?vfs=unix-flock' -i 3 -n 10000
Connected to resource: sqlite:/iscsi/tmp/test.db?vfs=unix-flock
worker  3 value = 30000 after 307.713789 secs
````

... although I'm still seeing the same 'first one wins then the others fight it out' behaviour, I am still getting the correct answer at the end. 

To Warren's point earlier about common code, I noticed there's an 'mptest' subdirectory in the sqlite source code, which can be used to test out parallel access to a database. I'll be giving that a go next.

(11) By Rowan Worth (sqweek) on 2020-09-21 07:38:31 in reply to 1 [link]

> The loop is constantly obtaining and releasing the lock on the database (mgr->beginTransaction() calls "BEGIN EXCLUSIVE TRANSACTION"). The idea is to introduce as much contention as possible to see if the transaction locking logic breaks down under stressful circumstances.

Using BEGIN EXCLUSIVE bypasses half of sqlite's locking protocol - a more stressful test would be to use regular BEGIN. You also probably have to adapt your code to do that, as (a) sometimes the UPDATE will get an SQLITE_BUSY error, in which case you'd need to ROLLBACK and retry the transaction to maintain the correct counter value, and (b) sometimes the COMMIT will get an SQLITE_BUSY error, in which case you can either ROLLBACK as above or keep trying the COMMIT (which will succeed once the readers relinquish their locks).

However note the "keep trying" strategy is affected by [a horrific bug](https://www.sqlite.org/src/tktview/810dc8038872e212361f31daf83ebf32484f3f9f) in sqlite versions 3.17.0 to... 3.32.2? (can't tell from the ticket exactly what version it is fixed in)

(12) By Spaced Cowboy (SpacedCowboy) on 2020-09-21 16:05:29 in reply to 11 [link]

This would be worth following up in the NFS case, or if I get around to configuring the clustering filesystem with a userspace personality that understands posix locks, but I don't think it has much impact on the flock() call, judging by the comments in the code:

````
************************** Begin flock Locking ********************************
**
** Use the flock() system call to do file locking.
**
** flock() locking is like dot-file locking in that the various
** fine-grain locking levels supported by SQLite are collapsed into
** a single exclusive lock.  In other words, SHARED, RESERVED, and
** PENDING locks are the same thing as an EXCLUSIVE lock.  SQLite
** still works when you do this, but concurrency is reduced since
** only a single process can be reading the database at a time.
````

I had to [patch](http://sqlite.1065341.n5.nabble.com/sqlite-users-DSQLITE-ENABLE-LOCKING-STYLE-1-doesn-t-work-on-Linux-Cygwin-td70789.html) the sqlite code to get flock() to work at all under Linux, but once enabled it did make the critical difference (crashing programs now worked). 

I'm not personally too concerned with the locking style in use, as long as it's functional and correct. The queries in use are highly unlikely to be time-consuming, so taking an EXCLUSIVE lock (while reducing concurrency) is likely to be massively outweighed by other processing and network costs in the system as a whole.

Actually, since you mention the bug, I just checked my SQLite amalgamation version number, and it's 3.31.0 so 

- It's possible the flock() call has been patched properly, in which case disregard the above patch comments
- I would apparently be vulnerable to that bug you mention (gulp: COMMIT not committing...).
- It seems it's more than past the time to re-do that amalgamation...