memory vs mmap

(1) By Paul Edwards (kerravon) on 2021-05-17 00:01:56 [link]

I was surprised to find that the ":memory:"
option of Sqlite is twice as fast than a
named file with a large pragma mmap so that
mmap() is used.

Everything should be completely in memory
either way, so it comes down to Sqlite call
overhead.

The ":memory:" database is populated using a
"restore from" to copy the disk file into
memory.

I would like the large read-only database to
be shared by multiple processes.

The main application is in Java and using
sqlite-jdbc.

The documentation says that in-memory databases
can only be shared by a single process:

https://www.sqlite.org/inmemorydb.html

What is the reason for that restriction, and
what is the technical reason that in-memory
is so much faster than mmap? And is there
any alternative to get closer to in-memory
performance while having the database shared
by multiple processes?

Thanks. Paul.

(2) By Keith Medcalf (kmedcalf) on 2021-05-17 01:12:37 in reply to 1

A ":memory:" database is EXCLUSIVELY owned by a SINGLE CONNECTION.  Of course it is blazingly fast when compared to a "shared" database which may have contention between multiple connections, and which has to perform I/O operations to extremely slow permanent storage.

There is a "continuum" of performance between an exclusive ":memory:" database and a "shared access" database which resides on permanent storage of some sort.

That is to say that a "shared access" ":memory:" database is SLOWER than an "exclusive access" ":memory:" database by the overhead imposed by having to manage the "shared" access.

Similarly, an exclusive access database living on persistent storage is "faster" that a shared access database living on persistent storage.

mmap does not speed up I/O.  It only eliminates double-buffering.

You can always put a database on a RAMDISK or /dev/shm on *nix-like OSes.

(3) By Paul Edwards (kerravon) on 2021-05-17 02:32:07 in reply to 2 [link]

Hi Keith. Thanks for your thoughts.

The use scenario is a machine with sufficient memory to hold the entire database in memory, but lots of processes accessing it, and insufficient memory if all of them expect a private copy of the read-only database.

I expected mmap() to load the entire database into virtual memory, and I didn't think it was important whether that happened immediately at startup or over time. If that is an issue, maybe what I need is an Sqlite modification that after the mmap it accesses the entire file to ensure it has all been loaded into memory upfront.

Note that the load time of copying the database from disk to memory, even if it needed to be done by every process, is not a problem. Fast startup is not required.

Are you suggesting a ramdisk would be faster, even if it goes through the same mmap() process? (ignoring the small once-off cost of loading from disk versus loading from RAM).

I think I might have stuffed up in my original message - the performance difference is running the actual long-running application, not reading the 1.6 GB database into memory.

I'm not familiar with Sqlite internals, so I assume the performance difference is being caused by the internal XFetch() or whatever calls being slower than whatever ":memory:" is using. I actually assumed that both mmap() and memory would make the same XFetch() calls so there would be no difference in speed, but that is not the case for some reason. That's what I would like to understand.

I'd also like to understand the technical restriction preventing :memory: databases from being shared across processes, at least for read-only. It's all the exact same data.

Thanks.

(4) By Keith Medcalf (kmedcalf) on 2021-05-17 02:53:11 in reply to 3 [link]

Note that a recent patch allows "memdb" databases to be shared by multiple connections (not shared-cache) in the same process.  

See <https://www.sqlite.org/src/info/533fffc4a39b01c3>

Presumably this will, after testing, be included in the next release.

To share a memory database across multiple processes, you need to put it in a shared-memory filesystem (ramdisk or /dev/shm for example).

(5) By Paul Edwards (kerravon) on 2021-05-17 03:14:58 in reply to 4 [link]

Hi Keith.

In my use case, only one copy of the database is needed by a single process, so sharing at that level doesn't help me.

But could you please elaborate on the comment about sharing a memory database across multiple processes using a ramdisk. A file on a ramdisk will have a particular name, and therefore it will be opened by that name, not :memory:.

And therefore the fastest way of accessing the ramdisk database is to use "pragma mmap", correct? All processes will point to that virtual memory address of that file on ramdisk which is already at a particular virtual address anyway.

If mmap() and xFetch() are being used anyway, surely there is only marginal benefit in moving the database from hard disk to ramdisk? For a process that is expected to run for maybe 30 minutes, accessing a 1.6 GB database. Alongside other processes doing the same thing to the exact same database, read-only.

Thanks. Paul.

(6) By Keith Medcalf (kmedcalf) on 2021-05-17 06:43:00 in reply to 5 [link]

> But could you please elaborate on the comment about sharing a memory database across multiple processes using a ramdisk. A file on a ramdisk will have a particular name, and therefore it will be opened by that name, not :memory:.

Correct.  A database on a RAMDISK is no different from a database on any other filesystem.

 > And therefore the fastest way of accessing the ramdisk database is to use "pragma mmap", correct? All processes will point to that virtual memory address of that file on ramdisk which is already at a particular virtual address anyway.

This does not follow because it assumes that it is not a standard block filesystem which is usually not the case.  There is no difference between a file on a filesystem (given the same filesystem) where the backing block store is memory, spinning rust, network accessed remote blocks of some kind, or sheets of paper stored in the Library of Congress.

 > If mmap() and xFetch() are being used anyway, surely there is only marginal benefit in moving the database from hard disk to ramdisk? For a process that is expected to run for maybe 30 minutes, accessing a 1.6 GB database. Alongside other processes doing the same thing to the exact same database, read-only.

Using MMAP rather than normal file I/O saves the difference between the time it would take to copy the block from kernel space to user space less the time it takes for the kernel to set up the page mapping.  Reading the data from the underlying filesystem still takes however long it takes.

MMAP is not necessarily any faster than normal I/O.

(7) By Paul Edwards (kerravon) on 2021-05-17 07:34:11 in reply to 6 [link]

Hi Keith.

Let me rephrase my question.

Assuming I have a file on ramdisk on a system with plenty of memory, so the file is already in physical memory (maybe a copy to /dev/null was done to ensure it was freshly-accessed). The file (database) is only 1.6 GB in size, and read-only.

I now have three ways of getting Sqlite to access that database:

1. Normal disk operations.

2. Pragma mmap so that Sqlite using virtual storage operations instead of file i/o operations.

3. Open a :memory: database and do a "restore from".

Forget about the initial overhead of doing (3). Initial overhead is not a problem. It's only the main application execution that I care about.

Would you expect (2) and (3) to produce identical runtimes, since they're both going through the exact same xFetch(), both accessing data that is, and always has been, in physical memory?

If the answer is "yes, they will be identical" then my next question is "would you expect a disk file overhead, still on (2) to add no extra time than what would be seen in the upfront overhead of (3)?".

If the answer is "yes, it will be near-identical", then that's what I thought too, but I was wrong. The long-running job (ie forget initial overhead) is taking twice as long with (2) compared to (3). (1) is 2.5 times longer than (3). I cannot explain this result.

Thanks. Paul.

(8) By Warren Young (wyoung) on 2021-05-17 10:58:59 in reply to 7 [link]

You're assuming that all memory is equally fast. RAM mapped into the local process by `malloc()` is *not necessarily* as fast as memory that was previously allocated only to the OS's buffer cache and is now `mmap()`ed into your process's memory space.

The first situation requires only a single user-to-kernel space transition per `malloc()` call at most, possibly less because `malloc()` implementations call [`brk(2)`](https://linux.die.net/man/2/brk) less often than they simply assign chunks from a previously allocated slab and return without calling into the kernel.

While it is possible that a given kernel could literally map every page belonging to a file into your process's RAM on the initial `mmap()` call, there's no actual requirement that it do so. It *could* simply take your arguments, remember your requirements, and then page chunks of that file's buffer cache space into your process as necessary. Page faults and user-to-kernel space transitions aren't free.

Even if it did map everything into your process's virtual address space as a single unbroken buffer, as you're apparently assuming, it would almost certainly do so in a fragmented way as compared to the `malloc()` based method used by SQLite's `:memory:` DB option. That's going to involve the CPU's [TLB](https://en.wikipedia.org/wiki/Translation_lookaside_buffer) mechanism, which is also very much not free to use.

You'd have to dig into your OS's implementation of `mmap()` to find out how it actually operates. Beware: you'll probably find it one of the hairiest parts of the kernel. Memory management on modern systems is *complicated*.

(9) By Paul Edwards (kerravon) on 2021-05-17 12:47:11 in reply to 8 [link]

Hi Warren.

Thanks for the technical info.

In that case, would you expect a file in ramdisk, perhaps just after a copy to /dev/null, to be near-identical speed to the malloc() because it goes through identical Sqlite code and accesses identical in-physical-RAM memory?

Thanks. Paul.

(10.1) By Stephan Beal (stephan) on 2021-05-17 12:52:27 edited from 10.0 in reply to 9 [link]

> In that case, would you expect a file in ramdisk, perhaps just after a copy to /dev/null, to be near-identical speed to the malloc() because it goes through identical Sqlite code and accesses identical in-physical-RAM memory?

Not in the slightest: a db on a RAM disk is, for sqlite3's purposes, on external storage (and sqlite has no way of knowing otherwise), so it's going through the filesystem-based API. The in-memory VFS is the fastest you're going to get, hands down, no comparison.

(12) By Paul Edwards (kerravon) on 2021-05-17 14:49:09 in reply to 10.1 [link]

Sorry, I didn't spell it out properly.

Let me try again.

Would you expect a file in ramdisk, perhaps just after a copy to /dev/null, WHEN OPENED WITH PRAGMA MMAP, to be near-identical speed to the malloc() (ie :memory: followed by "restore from") because it goes through identical Sqlite code and accesses identical in-physical-RAM memory?

Note - this is a question about Sqlite - :memory: vs mmap pointer to file ON RAMDISK. I think it is on-topic, as it is about Sqlite internals. I want to confirm the identical codepath. ie identical xFetch or whatever calls.

If Sqlite is identical in all its calls, then yes, the only thing left is to find out if there is a way to speed up mmap at the OS level (which is then off-topic and I will pursue elsewhere).

(13) By Stephan Beal (stephan) on 2021-05-17 15:18:35 in reply to 12 [link]

> Would you expect a file in ramdisk, perhaps just after a copy to /dev/null, WHEN OPENED WITH PRAGMA MMAP, to be near-identical speed to the malloc() (ie :memory: followed by "restore from") 

No...

> because it goes through identical Sqlite code and accesses identical in-physical-RAM memory?

Because that's not my expectation of the API. However...

> Note - this is a question about Sqlite - :memory: vs mmap pointer to file ON RAMDISK. I think it is on-topic, as it is about Sqlite internals. I want to confirm the identical codepath. ie identical xFetch or whatever calls.

We're going to have to wait on someone who's familiar with what sqlite *actually* does when the mmap pragma is enabled to enlighten us, but my outsider's guess is that the code paths do not become identical once mmap is enabled.

Even if the paths do become identical: as Keith pointed out, mmap does not necessarily eliminate I/O. OS-level file content caching is a "black box" feature applications cannot rely on. It might work as you intend on any given day on any given machine and it might not. That is two or three layers removed from where sqlite has any influence.

(14) By Paul Edwards (kerravon) on 2021-05-18 00:00:11 in reply to 13 [link]

I have put some printfs into sqlite 3.32.3 and found that when mmap() is being used, there are numerous calls to sqlite30sFetch.

But when :memory: is being used, there are none.

So before getting into the intricacies of kernel mmap implementations, I'd first like to get Sqlite pragma mmap to go through the :memory: code path.

After all, at that level, the logic is identical, isn't it? Regardless of how the kernel has implemented mmap(), the fact is that mmap() will return a virtual storage pointer. ie exactly the same thing that :memory: will do. Or am I missing something?

Perhaps there is an Sqlite pragma to make Sqlite using the :memory: logic, even for a mmap file? If so, that's what I'd like to try next. Any ideas?

It would seem to me that this is something that would be useful to everyone. You can potentially run your application twice as fast.

Thanks. Paul.

(15) By Scott Robison (casaderobison) on 2021-05-18 00:26:22 in reply to 14 [link]

As has been written, there are many ways the mmap can be implemented. One model might be:

Allocate one page of physical memory to the mapped file object. Any time you access a different address in the mapped file object, it will write that page back to disk and load a new page from disk to physical memory.

Another might be:

Allocate one page of physical memory for each page of the mapped file object. Do not preload any data. As access is made to each page of the mapped file, a page fault is generated that will read the data from the file to physical memory.

These will both have radically different performance profiles. The first will be thrashing that one page repeatedly (assuming one is accessing random locations within the mapped file object). The other will "quickly" allocate memory, but it will slow down reading random pages from the file as one continues to access random locations.

Ultimately, a bespoke algorithm that knows a lot about how the data will be accessed will potentially be more efficient than one that must handle generic workflows.

We don't know exactly how mmap is implemented, and it has to work regardless of page visitation order.

In the case of creating a memory database, it will allocate enough space for all the pages of the database and never need to fault a page to load it from disk (except in the case of swapping, perhaps, if too much memory has been allocated).

In no way can you ever assume that accessing malloced pages is ever going to be exactly the same as mmap, as there is a lot more at play in creating memory map objects.

(17) By Paul Edwards (kerravon) on 2021-05-18 00:34:37 in reply to 15 [link]

Hi Scott.

I'm not questioning (at the moment, anyway) how mmap() has been implemented by the OS.

All I want is to force Sqlite to use the :memory: codepath. I'm happy to provide an extra pragma to say "switch to :memory: codepath".

Is there some technical barrier to having such a thing?

The :memory: codepath already exists.

mmap() has already provided a pointer, the same as malloc() does.

All I want to do (at the moment) is marry up to the two.

If there's no existing pragma to do that, and there is no technical barrier, what would be involved in making such a change? ie approx how many lines of code?

Thanks. Paul.

(19) By Scott Robison (casaderobison) on 2021-05-18 01:14:56 in reply to 17 [link]

I don't know enough about the implementation detail, but I was trying to show you a big reason why an identical code path in SQLite accessing "simple raw pointers" could have wildly varying execution speeds due to the decisions made in implementing that underlying mapping object.

It is my experience that an mmap using application will typically create one big mapping object for an entire file, then any access to that will access the underlying memory object. However, I believe that due to the design decisions when mmap was first added to SQLite, it is only used for reading (not writing) data (which is your use case) as long as the file was under some certain size. Subsequently, I seem to recall that problems were discovered with one or more platforms mmap implementations and it was "deprecated" (that might not be the right word, but it seems right).

The page cache may perform one big allocation then access individual pages within that, but there is no requirement that the pages be continuous. A mmap object will necessarily be a contiguous chunk of the file.

So effectively you have one big mmap object that represents a sequential range of pages of the file, or you have a page cache which represents many potentially random pages in many page size chunks. This means that the method of accessing a memory DB will be of necessity very different than the code that accesses a mmap version, even though they both use pointers.

(20) By Paul Edwards (kerravon) on 2021-05-18 02:05:42 in reply to 19 [link]

Hi Scott.

When you say "page cache" - I assume you are talking about Sqlite internals rather than the underlying OS?

I have indeed seen page/pager mentioned a lot in the Sqlite C code, but I don't know what that is about.

So I guess what I am requesting is a pragma or whatever to force pages to be contiguous or whatever is required so that the efficient :memory: logic can be used.

I see code in Sqlite referring to posixIoMethods and nolockIoMethods etc. I'm not sure if this is the right place to be making a determination that we can switch to efficient :memory: IoMethods (I don't see something like that named).

BFN. Paul.

(21) By Stephan Beal (stephan) on 2021-05-18 02:13:33 in reply to 20 [link]

> So I guess what I am requesting is a pragma or whatever to force pages to be contiguous or whatever is required so that the efficient :memory: logic can be used.

Generically speaking, that would likely be impossible to do: code which needs to perform inserts would need to continually reallocate that block to add space for new records, invalidating any pointers held to it by higher-level code. Any and all pointers which referenced anything in that block would need to be accounted for and adjusted on every such realloc. You're only interested in the read-only case, but the VFS itself doesn't know that and has to have code paths for write mode as well. An mmap allocation is effectively one contiguous block, but any malloc-based non-trivial read/write data structure (a btree variant, in this case) cannot possibly be sensibly managed that way except, perhaps, in very specialized use cases which are far more limited in scope than sqlite's internals.

(22) By Paul Edwards (kerravon) on 2021-05-18 03:06:06 in reply to 21 [link]

Hi Stephan.

Thanks for the technical details regarding btree writing.

Is there some technical barrier to abandoning the VFS logic completely and moving over to the :memory: logic? I assume that is a separate set of functions.

Perhaps after a "pragma switch_to_memory" is given.

Or approach it from the other direction, such as ":memory:actually_use_this_file_with_mmap_instead"?

Thanks. Paul.

(25) By anonymous on 2021-05-18 18:14:38 in reply to 17 [link]

Does [sqlite3_deserialize](https://sqlite.org/c3ref/deserialize.html) fit your needs?

(16) By Larry Brasfield (larrybr) on 2021-05-18 00:33:33 in reply to 14 [link]

> I have put some printfs into sqlite 3.32.3 and found that when mmap() is being used, there are numerous calls to sqlite30sFetch.

> But when :memory: is being used, there are none.

> So before getting into the intricacies of kernel mmap implementations, I'd first like to get Sqlite pragma mmap to go through the :memory: code path.

The significant difference between those storage arrangement is: mmap necessarily retains backing of the mapped memory by a persistent store and must involve calls to the OS (or interrupts that reach the OS) at times, whereas the :memory: VFS is purely an application-level emulation of a (typically) OS-level file API. The latter has less work to do than the former.

I think your desire to get mmap to be as fast as :memory: is due to valuing speed over the modularity that would have to be sacrificed in order to get the speed.

(18) By Paul Edwards (kerravon) on 2021-05-18 00:41:33 in reply to 16 [link]

Hi Larry.

I do NOT want to sacrifice modularity. I am perfectly happy to keep all the existing code unchanged.

What I don't understand (and hopefully you can explain) is *at the Sqlite level*, what is the difference between a mmap pointer and :memory: pointer?

If the OS somehow manages to make mmap data slower (even mmapped from a ram disk), so be it. But I don't know why Sqlite would *need* to differentiate between two pointers and *must* go through a different codepath for mmap files.

Thanks. Paul.

(23) By Kees Nuyt (knu) on 2021-05-18 12:14:05 in reply to 18 [link]

> But I don't know why Sqlite would need to differentiate between two pointers and must go through a different codepath for mmap files.

I guess that is because a [:memory:][1] database can't  be fully [ACID][2]. When the program exits or crashes, the database is lost, so it is not `Durable`. A :memory: dataabase is so to speak just a volatile page cache without backing store. 

A file backed database (mmap or not) can be durable, with journaling and proper synchronization to permanent storage.

See also [Architecture][3]

[1]: https://www.sqlite.org/inmemorydb.html
[2]: https://www.sqlite.org/transactional.html
[3]: https://www.sqlite.org/arch.html

(24) By Simon Slavin (slavin) on 2021-05-18 13:26:56 in reply to 18 [link]

A memory-mapped file is just a file on backing store with a big cache.  Sometimes you have to access backing store.  For instance, you have to check the backing store to find out whether a file with that name already exists.  A journal file can be used, so checks need to be done about whether a file with that name exists, whether it's time to do a checkpoint, whether the journal file can be shrunk, etc..  And even if everything involved is in solid state, that's extra operations, and extra time needed for the operations.

A file in <code>:memory:</code> involves only access to memory.  It's all one big database, so you don't need to check whether a file already exists.  No journal can be maintained (because if you lose memory you will obviously lose all your data anyway).

(11) By Warren Young (wyoung) on 2021-05-17 12:56:35 in reply to 9 [link]

No, because once again your assumptions are wrong.

A RAM disk is not the same thing as a `brk(2)` block assigned persistently to the process. It's on the other side of a kernel wall, so all accesses have to go through the kernel, just as access to any other type of "disk" does. I'd therefore expect a RAM disk to be considerably slower than `malloc()`.

While it's fine to bring assumptions to the table when testing hypotheses, the system isn't going to lie to you about timing. When you benchmark the system in a given way, it tells you the truth about how long that sort of operation takes. If that falsifies your hypothesis, you have to either give up the initial assumption as disproven or you have to dig deeper.

Since digging deeper involves investigating how your OS works, it's off-topic here until you can pull that investigation back full-circle to show that SQLite is operating in an inefficient manner. Most likely, you'll find that SQLite is doing the best it can given the portable facilities available.

Until then, your results showing that different methods give different speeds is a valuable result: it tells you which method is fastest, so if that's all you wanted to know, you're done.

(26) By Max (Maxulite) on 2021-05-21 09:37:33 in reply to 1 [link]

Sorry, I'm a little late to the post, I read it, but by thoughts kept accumulating until final actions.
My assumptions that prevented me from just skipping this topic was
* Working and knowning file-mapping for ages, I have an assumption that the mechanism used heavily by modern OSes to save memory space, more or less should work without any penalty. So if a file sector is loaded into memory after the first hardware exception, there should be almost no virtual penalty for the consequence acceses
* Observing the sqlite team positive perfectionism, I don't expect any sqlite logical penalty unless it is absolutely necessary.

So I did a test (it's Windows, but I assume some minor hardware/system variations are possible with other desktop/server OSes)

<code> create table [testtable] ([id] integer primary key autoincrement, [value]) </code>

<code> insert into testtable (value) values(zeroblob(10000))</code>

The last one executed 10000 times, so we have a 100 MB database

And the test to observe the timing and reading is the cross join

<code> select count(*) from testtable, testtable </code>

I chose it because it doesn't consume dynamic memory, heavily resides on reading (since the 100 MB db doesn't fit in sqlite cache), consumes very little in CPU space and because of this is very prone to any delays at VFS reading level.

My initial tests showed more or less the same difference you mentioned. The memory query took 9 seconds, memory-mapped query took 21 seconds. For both times the cpu and total was almost the same, so more or less no waiting at the I/O level. And by the way, the full-scale file query took 2 minutes and more. Historically I have a "tweak" layer in my admin gui showing the total VFS read (xRead or xFetch) and the average speed of reading. On my machine (Windows 7, not so modern) the mem-mapped version read (fetched) 205 GB and the speed was 9GB/Sec

So something was not right. I looked at what requests came to my tweaked fetch routine, they were mostly equal to page size (4096). So naturally the next test was with the same database with the page increased to 65536. And this time the test performed much better. 10 Seconds, 1 TB total read with 117 GB/Sec fetch speed. So despite the fact that sqlite had to read more pages due to different structure mapping to bigger pages, the new x10 speed fully compensated this.

So, my version is that for default page size, the penalty is due to granularity of memory mapping view requests related to the page size. And you can probably easily overcome this with a bigger page size. I don't know though whether the improvements are possible for smaller page sizes without sacrificing the clarity of the sqlite VFS/IO level.

(27) By Paul Edwards (kerravon) on 2021-05-26 03:09:46 in reply to 26 [link]

Also sorry for the delay. Thanks Max for your test results. I'll see if we can get a test done on that. Meanwhile more tests are suggesting that the problem might be that the OS can cache local files fine, giving mmap() close to expected performance, but network drives are not being cached, and maybe we need some way of telling Linux that this file is read-only, please cache it instead of constantly checking to see if another system has updated the data, so reread it. But that's just a theory at the moment.

(28) By Paul Edwards (kerravon) on 2021-05-26 09:26:22 in reply to 27 [link]

I have test results now. Without mmap(), just normal disk access, a local hard disk is producing results just a little slower than loading into memory.

ie the overhead of read() instead of memory pointers is not dramatic, and the OS (Linux) appears to have cached the file into memory as expected.

This small performance difference, which could theoretically be completely eliminated with mmap(), is not really important to us, but maybe we will activate it anyway.

I normally don't have any dealings with networks, and was not aware that the disk was on a network, and that network disks are apparently ridiculously slow (about 6 times slower in the latest test), and probably bypassing the OS's caching ability.

I'm not sure what options exist if they insist on using network drives instead of local drives. I was thinking that we could have a softlink in /dev/shm pointing to the network drive and maybe that would encourage Linux to cache it. But I'm expecting the problem to be passed on to some network people to admit that networks are fundamentally ridiculous and that applications that require performance should not do network access.

(29) By Max (Maxulite) on 2021-05-26 12:43:10 in reply to 28 [link]

Great to hear you found the bottleneck. I admit that my test was not a very realistic in that it had a very high IO/CPU ratio. Just tried the one involving more or less the linear reading of the 100M file already in the system cache, the time of the disk db was almost as good as the equivalent one in the memory.   

An interesting side point. I found that :memory: databases indeed have some special treatment inside sqlite. My tweaking for VFS statistics never worked for memory databases. I suspected that I should just find a correct vfs. The one that jumped in the sources was "memdb", but actually it's another vfs-semantic memory database interface probably for serialization (available for finding only when sqlite is compiled with SQLITE_ENABLE_DESERIALIZE flag) and it looked like the bases opened with it works slower than native ":memory:" bases and even slower than memory-mapped disk bases (when the system already cached the data). I suppose this is an illustration of the phrase from the docs "Skipping the copy step is what makes memory mapped I/O faster". Probably ":memory:" databases are also able skip the copy step.

(30) By Paul Edwards (kerravon) on 2021-06-01 02:44:34 in reply to 29 [link]

This problem seems to have drawn to a conclusion as of today, so I will document my understanding for closure.

If network disks were massively slower than local hard disks, I don't think anyone would use them, and it would be a well-known problem.

The network team didn't believe that a local disk was faster that a network disk, and they said the /tmp test was invalid because it was virtual memory. They were only convinced when they saw a real local disk performing at the same speed as /tmp. This is to be expected, as a read-only disk file of 1.6 GB is read in just a few seconds and then should be permanently cached in memory.

The characteristics of the network disk is that in the beginning, maybe for a couple of minutes, it is about 2.5 times slower than a local disk, but over time it ends up being 20 times slower. I know of no hardware with such characteristics so I said this was an additional problem that should be investigated, on top of the failure of the network file to be cached.

I can't see how it would be possible for Linux to actually cache a network disk, because it has no way of knowing if another system updated the file. I would expect Linux needs some special configuration option to say "this network file is static, please cache it whenever it is opened".

No further investigation is going to be done, as the conclusion has been drawn that it is a problem with NFS3 and can't be changed.

I know nothing about networks, and I am not in that group, and it is way beyond my authority, so I can't do anything further either.

The solution being chosen appears to be to put the database file on local disks as required, and point to it via a softlink from the network drive.

This is a very strange situation in 2021.

(31) By Rowan Worth (sqweek) on 2021-06-01 03:47:20 in reply to 30 [link]

The fundamental issue with networked disk vs. local disk is latency. This can be accomodated in many workloads by queuing up i/o requests (see also readahead, asynchronous i/o), however any application which relies on the idiomatic read/write syscalls will suffer somewhat against a network fs especially if it is seeking around.

sqlite's workload is very seek heavy and there's no reasonable way to anticipate which pages of the database file will be needed in advance, so it is especially affected by latency.

> I would expect Linux needs some special configuration option to say "this network file is static, please cache it whenever it is opened".

Yes, many network caching options and configurations exist, each with various features/compromises. But sqlite calls fsync() regularly to ensure database consistency and this likely bypasses any advantage you might get from caching.

If the DB legitimately never changes you can tell sqlite when opening it to use [immutable mode](https://sqlite.org/uri.html#recognized_query_parameters), and with a sufficiently large [page cache](https://sqlite.org/pragma.html#pragma_cache_size) that should ensure the DB isn't read more than once. However in that situation I would also question what benefit the DB is offering over eg. a flat file stored in /dev/shm

(32) By Paul Edwards (kerravon) on 2021-06-01 04:31:31 in reply to 31 [link]

Hi Rowan. Thanks for the pointer to immutable mode, but can I just elaborate my expectations.

The 1.6 GB read-only database may be read by 10 processes simultaneously. What I'm looking for is for Linux to fully cache that file in memory, but with just one copy of it. It's not something for Sqlite to do, although maybe fsync() interferes with Linux's ability to do this?

The solution of copying the file to /dev/shm will work, and so will an ordinary local disk. But that means a more difficult deployment instead of just sticking the file on a network drive for everyone to use.

If you think that immutable mode might solve this problem, I can suggest it.

(33) By Rowan Worth (sqweek) on 2021-06-01 06:14:18 in reply to 32 [link]

Presuming that the DB is legitimately read-only (ie. it's not accidently/unintentionally being modified and thus fsync is not being called), I wouldn't expect immutable mode to make a significant performance difference.

I would certainly expect linux to keep the file contents in its kernel page cache, and I have observed that behaviour on the systems I work with when dealing with unchanging files on network storage. This is not really related to sqlite though, as you say.

It might be worth running a baseline test against an ordinary file to see if your nfs is broken in general or just when dealing with sqlite DBs, if that hasn't yet been established (I couldn't see a clear conclusion with a quick scan of the thread).

(34) By Paul Edwards (kerravon) on 2021-06-01 08:25:45 in reply to 33 [link]

Hi Rowan.

How is it technically possible for Linux to cache the contents of a network file? Doesn't it need to continually go back to the network to find out if some other system has updated the file?

Even if the file is read-only on this system, one Linux system has no way of second-guessing what other systems are doing, does it?

Thanks. Paul.

(35) By Rowan Worth (sqweek) on 2021-06-01 09:14:14 in reply to 34 [link]

```
linux client: "yo fileserver I'm opening this file because I'm interested in reading its contents"
fileserver: "roger that, you want any data in particular?"
linux client: "give me 16384 bytes starting from byte 0"
fileserver: "here you go"
linux client: "thx, btw I'm caching this"
... time passes...
fileserver: "hey just fyi another client modified this file you're interested in"
linux client: "roger that, I'll invalidate my cache"
```

I'm not an nfs protocol expert and don't know if it looks anything like this, it's just a simple example of "technically possible."

(37) By Clemens Ladisch (cladisch) on 2021-06-02 08:09:08 in reply to 35 [link]

> fileserver: "hey just fyi another client modified this file you're interested in"  
> linux client: "roger that, I'll invalidate my cache"

Windows (SMB/CIFS) calls this opportunistic locks (oplocks), NFS calls this delegations. (The server tells the caching client to drop/flush its cache *before* allowing another client to access the file.)

If the network connection is temporarily broken and the client uses the stale cache later, you get data corruption. This was a somewhat common problem with SMB; I guess recent protocol versions protect against this.

(36) By anonymous on 2021-06-01 10:46:50 in reply to 30 [link]

I have been using SQLite on network storage using Ceph replication (similar to the AWS block storage thingy). In which only one client gets access to the networked desk. And the mmap performance was exactly where you expect it to be. Reads were happening mostly from memory as the read blocks get cached (and for the record, if the SQLite connection cache is large enough to cover most of the working set then you will not see improvement with mmap, except for the case when you get frequent updates to the database from other connections, which invalidates the SQLite connection cache AFAIK)

I would expect NFS (and AWS' EFS) to exhibit a different read behaviour due to the possibility of some client changing the data that is cached by some other client.

(38.2) By Warren Young (wyoung) on 2021-06-02 11:04:51 edited from 38.1 in reply to 30 [link]

> If network disks were massively slower than local hard disks, …it would be a well-known problem.

It *is* a well-known problem. Maybe not well-known to you and yours, but I can dig up references from the dawn of network filesystem based computing in the early 1980s for this, if you require it.

> The network team didn't believe that a local disk was faster that a network disk

If you're testing matching storage on both sides of the connection, then of course the local disk will not be any faster than the actual remote *disk*, but the remote disk isn't attached to the local CPU in the same way the local disk is. There are a whole *pile* of other considerations stemming from that fact:

1. Local NVMe latency is going to be orders of magnitude faster than ping time to the remote NAS just on speed-of-light delays alone. The distance between the NVMe connector and the CPU is maybe a few cm long. The Ethernet cabling, switch fabric, internal MAC to CPU link, etc. is likely to come to tens of meters, so you're orders of magnitude off already. Ya canna change the laws o' physics, captin'! [A foot of copper is a nanosecond](https://hackaday.com/2012/02/27/visualizing-a-nanosecond/), and at today's speeds, nanoseconds matter.

    Once upon a time, speed-of-light delays only affected computer design at the extreme high end, but we've been at that point down at the low end now too, for a very long time. If you did nothing to a Raspberry Pi but double its physical size, scaling everything on that board evenly, it would *cease…to…work!*

2. The "stack" depth of a bare NVMe disk is a lot shallower than for

          Ethernet+ICMP+IP+UDP+kernel+NVMe+UDP+IP+ICMP+Ethernet

    Every packet of data to a typical remote NAS has to go clear down through the TCP/IP stack into the fileserver software, then back up that same long stack to return the result.

    And that's only for in-kernel network filesystems. If you're dealing with a userspace NAS technology like Samba, you have a kernel-to-userspace transition and probably a deeper file server code stack besides.

3. Cooperation delays aren't zero. You speak of having 10 readers, but at bare minimum that must cut per-reader bandwidth by 10 on the remote side, and it's only that low only if there is no arbitration among the readers, so that all readers can proceed in parallel, the ÷10 factor coming only from the shared disk bandwidth.

Let's get concrete. I just did a ping time test here to a nearby NAS on a quiet gigabit LAN and got an average of 350 µs over 25 packets. The read latency on a good NVMe SSD is likely to be 10 times lower than that, and with the ping test, you haven't even *considered* disk access latency yet.

If your own tests tell you there's an order of magnitude or two difference between two test cases, *believe the test result*, don't resort to argument-from-incredulity to talk yourself into believing that the test was wrong.

> The characteristics of the network disk is that in the beginning, maybe for a couple of minutes, it is about 2.5 times slower than a local disk

I'm stunned it's that low! I'd consider that wild success.

> over time it ends up being 20 times slower

*Just* a one order of magnitude speed drop? If after reading my arguments above you think that's unreasonable, go read them again. Point 1 alone shows you're experiencing an excellent result.

> it has no way of knowing if another system updated the file.

As others have shown, that's a solvable problem, but you can't solve it for free. All coordination I/O has to go between the nodes somehow, and that has a cost.

> the conclusion has been drawn that it is a problem with NFS3 and can't be changed.

NFS certainly has problems, but you can't charge all of the problems you're seeing to a network protocol alone. You can invent the bestest baddest speediest network file sharing protocol, and you'll still have the speed-of-light delays and coordination overhead among cooperating nodes.

TANSTAAFL.

(41) By Paul Edwards (kerravon) on 2021-06-04 03:35:24 in reply to 38.2 [link]

Thanks for the technical information, but I don't understand why you are expecting variable results from the same data file. ie sometimes 2.5X slower (with short runs), sometimes 20X slower (in long runs). ie compared to memory.

Basically if we use an input file with 1000 records in it, and it takes 1 minute (while accessing the read-only 1.6 GB database on a network drive), then I expect that if we process 10,000 records that it should take 10 minutes.

But that is not what we see. For a single job on a single machine, every time we process larger amounts of data it starts getting "exponentially" slower.

It is somewhat amusing that this word was used to report the problem instead of focusing on the numbers, because the developer thought it was an odd complaint, and said that yes, it will obviously increase exponentially as you give it larger input files, but when I was presented with the numbers myself, I asked him what his expectations were (like above), he did in fact expect it to SCALE, and he agreed that "exponential" was the wrong word to use.

I then told him IT WASN'T SCALING AS EXPECTED. And only then did he actually start investigating the problem and it was actually him who somehow thought to try putting the database in /tmp which he somehow knew wasn't on the network.

I had previously been telling him that disks are cached in memory, so it isn't an issue with disk speed. I didn't expect that in 2021 I would be in a commercial environment where we had uncached disks, which effectively bypasses decades of ridiculously high CPU clock speeds and ridiculous amounts of memory. I started using SMARTDRV.SYS in 1991 with MSDOS 5.0. After a fairly short number of years, the OSes started doing the caching automatically and unconditionally and without scope for user configuration, so that was the end of any disk caching interest.

Until someone said "NFS3" and someone else said "delegations".

(39) By Clemens Ladisch (cladisch) on 2021-06-02 19:02:49 in reply to 30 [link]

> it is a problem with NFS3

Correct. Delegations, which allow caching, were introduced in NFS4.

(40) By Paul Edwards (kerravon) on 2021-06-04 03:18:16 in reply to 39 [link]

Thanks for that very concise description of the most major problem, and the technical term "delegations" which I had never heard of before.