SQLite Forum

Parallel async page fetching from (read-only) db?
Login

Parallel async page fetching from (read-only) db?

(1) By Dan Forsberg (dforsber) on 2021-01-01 22:09:28 [link] [source]

Hello,

My apologies, if this has been discussed (possibly many times) before. I've not been following this forum and quick searches didn't reveal results to my question.

In cloud, where compute, control, and data planes are separate, the way to make things work fast is concurrency, async IO, metadata/index caching, etc. ..mostly due to latency and the inherent nature of horizontal scalability (concurrency) and high/broad bandwidth. So, this is the direction I'm heading and the basis of my questions. I have searched a bit about the KV store backends for SQLite3 etc, but just scratched the surface. I wish I'm not trying to get something out of SQLite3 that it is not fit for, I've lived long enough to know where it came from and the serverless cloud is a "bit" different customer :D.

I know that SQLite3 runs single query at a time (at least that's what I've explained to myself) and uses threads to speed up "computation", but not entirely sure if it also speeds up concurrent page fetching from disk(?). In other words, does SQLite3 try to fetch multiple pages, as in for example when fetching multiple rows, or is it serial in a manner that each page is fetched once the page id is found via the b-tree? Somehow I get this serial nature of data fetching from the Query Planning documentation.

I'm using better-sqlite3 with node worker threads for read-only purpose for fetching data from db in network filesystem (Lambda with EFS). I can get pretty decent performance by splitting the row numbers evenly to a number of worker threads in a single process (Lambda). In other words, I get the data much faster with let's say 10 workers compared to main thread only for the same amount of rows fetching from SQLite3 db. But it is a bit unoptimal to run multiple SQLite3 engines. Anyway, the workload is network IO bound.

I have not started to read the code yet, so apologies for dummy newbie questions. I have many questions, and I'm just dumping them here in a single post :). I can separate them if needed.

- Does SQLite3 cache the whole B-tree (for rowid and/or index) in-mem if the cache size is big enough for a db? Or is the B-tree mixed within the data in a way that caching the B-tree is not "automatic", i.e. the B-tree can't be read off from the db file easily. I'm just trying to understand that with network IO bound workloads, can SQLite3 effectively pick the right and needed pages from disk without having to do multiple round trips for hitting the B-tree pages and thus effectively work like KV store (when fetching rows based on rowid), where keys are already known.

- Does SQLite3 optimise e.g. "SELECT * FROM table WHERE rowid IN (1,2,3,...,K)" queries in a way that it would be able to launch N page fetches in parallel (e.g. equal to the number of configured threads)? If no, how big job would it be to marry SQLite3 with libuv for async IO (and thus at least partially concurrent page fetches)? Let's say you get IO handles from libuv and "materialize" (wait for mutex, "await promise") them only when you really need them (e.g. sending the data back to client or start executing filters etc.).

- Finally, is it possible (or how hard would it be to make it possible) for the query planner to support full (sub)query pushdown to a virtual table (extension)? If yes, is there a documented data format that the query would need to return?


Cheers,
- Dan

(2) By Keith Medcalf (kmedcalf) on 2021-01-02 02:56:41 in reply to 1 [link] [source]

I know that SQLite3 runs single query at a time (at least that's what I've explained to myself) and uses threads to speed up "computation"

SQLite3 is an in-line, in-process, run-on-the-calling-thread library. It does not use multiple threads to speed up anything. Everything it does is done on the single thread of the client process that is making the call into the SQLite3 library.

The only exception is that SQLite3 may, if and only if configured to do so, spin up worker threads to assist in the process of sorting. Sorting is effectively done using a partitioned merge sort. It can be optionally configured to spin up separate threads to sort partitioned subsets before they are merged back into the resulting sorted set. (For example, if there are 50,000 records to sort, then each block of 1000 records may be sorted individually to temporary files, and after all those partitions are sorted, they are merged back into a single result. SQLite3 may if configured to do so spin up a number of worker threads to sort the 50 partitions in parallel before merging the results into the result set.)

Other than this single solitary user configurable exception all processing by SQLite3 occurs sequentially, step after each, on the thread of the caller. This includes all operations required to sequentially, one-after-each, return, using the callers' thread, each record one at a time, as it is found, comprising the result set.

This means that when a client makes a call into the SQLite3 library that the single thread of the caller is used to sequentially perform all the operations which may include one or more I/O operations until a "result row" is returned to the caller. Lather rinse and repeat until there are no more results.

but not entirely sure if it also speeds up concurrent page fetching from disk(?). In other words, does SQLite3 try to fetch multiple pages, as in for example when fetching multiple rows, or is it serial in a manner that each page is fetched once the page id is found via the b-tree?

There is no parallelization whatsoever. Every operation is performed one-after-each on the thread of the caller. Pages are retrieved from persistent storage to the SQLite3 page cache when they are found to be needed. There is no prediction and no multiple fetches, no batching, and no concurrency or multithreading.

If during the execution of fetching a row of results it is determined that a page is required that is not in the SQLite3 page cache, then an I/O operation is issued on that thread to the OS to perform the I/O, for the one page that is required. What the underlying OS or Filesystem or channel controller or physical hardware chooses to do when it receives this request is up to the OS or filesystem or channel controller or physical device. The filesystem and OS may decide to read a "whole gob smack of pages" but will return only the one page requested to the SQLite3 program making the request. Whatever changed the I/O request from a single page of data to whatever gob smack was requested is in charge of dealing with the consequences of its own decision. SQLite3 wants the one page it requested, only the one page it requested, and nothing but the one page it requested.

Does SQLite3 cache the whole B-tree (for rowid and/or index) in-mem if the cache size is big enough for a db?

All pages read from persistent storage via an OS syscall are cached in the SQLite3 page cache. No page is treated different from any other. If the SQLite3 page cache is big enough then every page will eventually reside in the SQLite3 page cache, assuming that at least one read (request to the filesystem and OS to fetch the page from persistent storage) is done of each page, and there are no other writers that cause the cache to be invalidated. Any page not read (that is not required as part of the process of generating a result row) will not be loaded into the SQLite3 page cache. Managing the SQLite3 page cache is handled by SQLite3. The filesystem or OS or storage channel or storage device may operate other levels of cache below this which are not within the knowledge or control of SQLite3.

Does SQLite3 optimise e.g. "SELECT * FROM table WHERE rowid IN (1,2,3,...,K)" queries in a way that it would be able to launch N page fetches in parallel

No. SQLite3 will execute the request process in-line, step-after-each, returning on the callers thread one-after-each, the result set. In other words, the first time you STEP that statement, SQLite3 will build the temporary set for the IN, then retrieve the first value (1), then retrieve (if not already in the SQLite3 page cache) all the pages required to generate the first result row, which will then be returned back to the called. Lather, rinse, repeat for the next call (which will now look for the row with rowid == 2, and retrieve all the pages one-after-each if not already in the cache) in order to return the next row to the caller. This will continue until there is no more result rows to return at which point the called will be told SQLITE_DONE. There will be no parallelism and no concurrency. Each step will be carried out entirely on the thread of the caller, one at a time. If you have 5000 threads each requesting one record, then they will be serialized because only ONE thread at a time can be executing inside a connection.

As for the complexity of adding parallelism of some sort, that is the realm of server type database products, and there are already lots of those. SQLite3 is an in-process, in-line, serverless database engine (and no, this is serverless as in not having a server, not "serverless" as in the new fangled buzzword crap that is totally meaningless, misleading, illogical, and a downright lie).

(3) By Wout Mertens (wmertens) on 2021-01-02 07:29:53 in reply to 2 [link] [source]

All pages read from persistent storage via an OS syscall are cached in the SQLite3 page cache.

I presume the shared memory that WAL uses isn't used for the page cache? At first glance, that seems like a nice optimization...

@dforsber: > In other words, I get the data much faster with let's say 10 workers compared to main thread only for the same amount of rows fetching from SQLite3 db.

I'm also looking at better-sqlite for my https://www.npmjs.com/package/strato-db wrapper. Did you try with 1 worker vs main thread?

The problem with better-sqlite and http servers is that it isn't asynchronous, and thus it blocks the server for however long the queries run. I opened https://github.com/JoshuaWise/better-sqlite3/issues/32 for that. Now that it can be run in WebWorkers, I wonder what the effect of multiple threads is on throughput and memory use.

It seems to me that, given unlimited cores, running multiple readers at once will only make the query processing run in parallel, not the query I/O. So if it takes 0.5 ms to get all the data and 0.2ms to process it all (in SQLite+JS), then N worker threads would reduce that to 0.5ms + 0.2ms/N. Which seems like a minimal gain.

However, a single worker thread would free up the main thread for HTTP handling, so that's definitely a plus for latency. 100 simultaneous requests would still have to wait their turn for results, but they would get immediate HTTP responses.

(4) By Keith Medcalf (kmedcalf) on 2021-01-02 09:30:11 in reply to 3 [link] [source]

I presume the shared memory that WAL uses isn't used for the page cache?

Define what you mean by "shared memory that WAL uses". A process that has a database open uses memory that is "shared" with every other process operating on the same computer (it also "shares" the usage of CPU and persistent storage with other processes.

(5) By Wout Mertens (wmertens) on 2021-01-02 10:39:02 in reply to 4 [link] [source]

Define what you mean by "shared memory that WAL uses"

If you use WAL journaling, it creates a .wal file and a .shm file. All processes opening that DB have access to the block of memory represented by the .shm file.

So if that shared block would contain the page cache, that would optimize memory use for the system, but I can imagine that being difficult to orchestrate.

(6) By anonymous on 2021-01-02 12:21:07 in reply to 5 [link] [source]

The .sum file is an index to the pages in the WAL file to speed up page search. It does not contain the page cache

(8) By Dan Forsberg (dforsber) on 2021-01-02 13:11:23 in reply to 2 [link] [source]

Thanks for the prompt reply. Clarifies a lot about the usage of threads inside SQLite3.

If you have 5000 threads each requesting one record, then they will be serialized because only ONE thread at a time can be executing inside a connection.

Pls correct me if I'm wrong. With nodejs worker_threads I'm creating a new SQLite3 instance/connection in each thread and connecting it over to the same database. So, they wount share a connection and thus effectively are working as a separate engines working in parallel against the same database. And as I've understood reading from the database concurrently is allowed. Let's say we have 1000 of these node processes and each of them uses 10 worker_threads, we could have 10k SQLite3 connections to the same database.

As for the complexity of adding parallelism of some sort, that is the realm of server type database products, and there are already lots of those. SQLite3 is an in-process, in-line, serverless database engine

I was just hoping to find an embedded row store db (OLTPish) (to pair with DuckDB column store (OLAPish)) but with a modern async IO architecture that fits with needs geared towards heavy parallel IO and streaming. As today, scaling CPU fast is not an issue on cloud. So, the promise is to calculate everything as fast as possible as concurrently as possible and with as big burst of data as possible.

Persistent RAM is already here as well, so I think the importance of this only grows. Given the nature of the cloud and kubernetes, FaaS will become more dominating from the PaaS consumer point of view (e.g. FaaS on kubernetes or cloud natively) and will raise the opportunity for serverless/embeddable dbs.

Give the nature of straight forward page fetching 1 by 1, I'm tempted to think that it could be a nice challenge to model the pages as async handles (promises). However, the real challenge is the query planner and executor, i.e. whether the executor is pipelined (multiple stages/steps) enough for allowing batching. (I'm talking about possible modifications to the code, not the way how it is currently). I'm not very positive about this, given your response.

Maybe beating an old horse is not paying off and would be more like swimming against heavy current. So, finding out an excellent SQL query graph generator would probably be a proper starting point.

That said, I think SQL as an API will only grow in the future and I would be happy to see SQLite3 grow with it.

(9) By Warren Young (wyoung) on 2021-01-02 14:41:00 in reply to 8 [link] [source]

Quoting the Node docs:

Workers (threads) are useful for performing CPU-intensive JavaScript operations. They do not help much with I/O-intensive work.

Also, if each Node worker thread is implemented by an OS thread, your 10k threads are likely to cost on the order of 10 GiB of RAM, just for the per-thread stacks.

Use some sort of worker pool instead, the number a small multiple of the CPU core count at most. More will not help.

(10) By Dan Forsberg (dforsber) on 2021-01-02 15:27:43 in reply to 9 [link] [source]

Node IO is efficient async so that holds in that domain, but not with SQLite3 which blocks the (main) thread. [In my case, the execution env is AWS Lambda and the cost is based on compute time and size of container (memory). So, the IO should happen in a big burst and complete fast, not serialized calls to process pages one by one]

(21) By Keith Medcalf (kmedcalf) on 2021-01-02 21:49:08 in reply to 10 [link] [source]

I/O in SQLite3 blocks the "callers" thread, not the main thread. There is no point in performing asynchronous or overlapped I/O from the perspective of the caller since the caller cannot "make progress" until the data required has been made available -- that is, the requirement for the I/O is synchronous. It is not as if the calling thread can "go off and compute something else" while waiting for the I/O to complete.

Multithreading is merely a continuation on the spectrum of various ways of achieving the concept, developed in computer Operating Systems more than 70 years ago, called multiprogramming, which is based on the nifty idea that while one "thing" is waiting for something to occur, then some other "thing" can be doing something else, with the objective to be to keep all available resources 100& utilized at all times.

(11) By Wout Mertens (wmertens) on 2021-01-02 15:51:29 in reply to 9 [link] [source]

Yes, but the Node docs also clarify that they don't help with i/o because Node is better at it.

SQLite doesn't do threads, so node worker threads running SQLite synchronously will actually see some benefit.

A worker pool will most likely still be a good idea.

(12.2) By Warren Young (wyoung) on 2021-01-02 16:14:57 edited from 12.1 in reply to 11 [link] [source]

A worker pool will most likely still be a good idea.

Agreed, because there's a vast difference between an I/O worker pool running 10 threads on an 8-core box and a Node process spawning 10k OS threads all banging on the I/O subsystem, which likely hasn't got much parallelism to it in the first place.

To be fair, SQLite isn't fully I/O bound, so maybe you could raise that limit some, say to 20 or 30. But 10k? Ludicrous.

(13) By Dan Forsberg (dforsber) on 2021-01-02 17:27:03 in reply to 12.2 [link] [source]

Each worker_thread is its own Node process, so they run fully in parallel, only the result passing to main thread requires coordination.

Yes, that 10k was just an example of the possibilities of Lambda. 1k Lambda calls, each having e.g. 10 worker_threads, each having its own SQLite3 connection. In effect, they are 1k processes with 11 threads (main + 10 workers).

(15) By Warren Young (wyoung) on 2021-01-02 17:33:20 in reply to 13 [link] [source]

Even worse, then.

If you want 10k "threads" on a server without heroic levels of hardware, you need some sort of green threads implementation.

This is how Erlang gets to 100k "processes" (actually green threads) on modest hardware.

(16) By Dan Forsberg (dforsber) on 2021-01-02 17:40:22 in reply to 15 [link] [source]

As a consumer of cloud, I don't have to care about the actual implementation, I just have independent container running my code with dedicated CPU and memory. In other words, I'm not trying to build FaaS.

(17.1) By Wout Mertens (wmertens) on 2021-01-02 18:33:37 edited from 17.0 in reply to 16 [link] [source]

SQLite requires local storage, so unless you want to pay extremely big bucks, you're better off using a pool of connection workers.

If you want to improve parallelism beyond that, it has to happen at the application layer.

FaaS and SQLite don't mix, except if your db is completely read-only and you have a mechanism for updating it and can handle the rollout lag inconsistencies.

(18) By Dan Forsberg (dforsber) on 2021-01-02 19:31:38 in reply to 17.1 [link] [source]

We use EFS (network filesystem, like NFS) with Lambda, and yes, pool of connection workers. Not that expensive, but not yet as optimal as it could be.

Even the pool of connection workers has to be managed on the app level as there is no common query executor/planner that spreads the tasks between multiple different engine instances/connections.

[IMO the term connection is a bit misleading as the database engine and database file are separated and connection is usually used with client/server databases.]

(19) By Dan Forsberg (dforsber) on 2021-01-02 19:36:20 in reply to 17.1 [link] [source]

FaaS and SQLite match very well with the paradigm of separating compute and storage. You pay only for storage if you don't execute any queries. Storage meaning something external to the FaaS, i.e. not the process local disk/mem.

I would think that engine upgrade is similar to server side, except that it is easier with FaaS, or more specifically in the case where data and compute are separated (no need to upgrade a running cluster, no need to copy data around etc.).

(20) By Wout Mertens (wmertens) on 2021-01-02 21:35:10 in reply to 19 [source]

But how will the FaaS SQLite instance access the storage? SQLite does not support writes on network filesystems?

(23) By Dan Forsberg (dforsber) on 2021-01-02 22:28:51 in reply to 20 [link] [source]

In this case it is AWS Lambda with EFS and it plays like a locally mounted filesystem from SQLite's perspective. Similar to NFS/Samba, just a local filesystem.

[FaaS = "Functions as a Service" a.k.a serverless]

(24) By anonymous on 2021-01-03 12:38:41 in reply to 23 [link] [source]

EFS might work correctly but .. You will most likely lose the ability to enable the WAL journaling mode, since the wall index shared memory cannot be created over NFS and the likes.

Also you are pushing the lowest level of IO in SQLite to the network layer. Any lock you acquire/release will have to take a network round-trip. This will result in quite some bit of latency that cannot be amortized. So both latency and throughput will take a (pretty sizable) hit.

You might get away with decent performance (and scalbility) for read requests, but writes will greatly suffer.

(25) By Warren Young (wyoung) on 2021-01-03 18:55:15 in reply to 24 [link] [source]

Agreed. I suggest that the OP is better off with one of the distributed SQLite variants, the major ones being BedrockDB, rqlite, and dqlite. If configured for AP mode operation, the I/O latency savings will probably buy more speed than any of this messing about trying to squeeze data parallelism from a single-threaded stone.

(26) By Keith Medcalf (kmedcalf) on 2021-01-03 19:49:32 in reply to 25 [link] [source]

Well, since there seems to be a preponderance of buzz-word loving these days you could describe SQLite3's I/O model as being JIT optimized. It does I/O Just-in-Time and obeys the primary laws of optimizing I/O: The fastest way to do I/O is to not do it; and, (b) Never do now what you can do later because later you may discover that you did't have to do it at all.

(28) By Dan Forsberg (dforsber) on 2021-01-04 11:32:20 in reply to 26 [link] [source]

This law does not apply in case of high latency IO (e.g. network), and I wonder if it applies anywhere anymore nowadays (are IO buses strictly serial or do they allow concurrency). It is fairly easy to imagine that if you need 10 rows from db and you have all the page ids for the data, the fastest way to get the data is to fire off all the page requests simultaneously - assuming the system can take it (i.e. IOPS saturation doesn't happen). Similar to why batch processing is faster than one-by-one processing.

(31) By Warren Young (wyoung) on 2021-01-04 18:51:00 in reply to 28 [link] [source]

Parallel requests are only a partial end-around the mathematics of Amdahl's Law. It's why nine women can't make a baby in a month, and why Fred Brooks is famous. Serialization and communication overhead collapses parallelism.

(32.1) By Keith Medcalf (kmedcalf) on 2021-01-04 20:38:36 edited from 32.0 in reply to 28 [link] [source]

Well no. I/O is inherently serial, so there is no benefit to be obtained by issuing all the I/O requests at once. You still cannot "complete" until they are all complete.

Now then assuming that each I/O is requesting data from a different location (cylinder and head on spinning rust), then the "flight director" can optimize the travel order between the different locations in order to retrieve the different requested pieces in the least time. This is the 1950's technology basis for performing queued or overlapped I/O. It requires that the actual I/O be co-processed. Data phase disconnect does nothing if there is only one CPU handling the main process and the CPU. This is why slow old mainframes in the dinosaur pen are so staggeringly faster than bitty-boxes: channel processors.

However, the overall task can still not be completed until all the I/O is complete, and that is still a serial process.

Plus, of course, managing the "flight director" properly also adds overhead time which may very well exceed any improvement gained.

However, in a general purpose multitasking OS overlapped I/O is achieved by having multiple processes making synchronous I/O requests (which then become queued or overlapped I/O at the "next level up"). From the point of view of a single process (or thread of execution) the overhead of pre-computing what I/O is required so that it can be issued as a queued/overlapped request is, except in very rare cases, greater than the maximum possible benefit that can be achieved thereby.

(33) By Dan Forsberg (dforsber) on 2021-01-04 21:33:36 in reply to 32.1 [link] [source]

I guess we speak about two different things :). Cloud is a different world as its abstraction layer is higher up. Data is replicated to multiple locations API/IO calls are handled by multiple instances etc. SANs-and-whatever-things-they-nowadays-have-purposedly-built with CPUs that have multiple cores, caches, memory mapped IOs, and much more. Tests showed good results anyway.

(27) By Dan Forsberg (dforsber) on 2021-01-04 11:27:34 in reply to 25 [link] [source]

Yes, I don't need writes, only reads. Db must be embedded in the process and data shared over network, thus all client/server solutions are out of question. Thus, distribution doesn't help for read only workloads (e.g. rqlite, dqlite).

  • App level splitting of the work for worker pool is the way to go. Each SQLite instance will then do its work at the same time with others and get data back to the main thread.

  • Possible optimisations could include doing cache pre-warming so that the b-tree is in memory already, e.g. scanning the rowid column.

  • Plus setting the page size big enough to fill a row so that there is no need to do multiple roundtrips when fetching a single row (minimal unit of work for a SQLite3 instance)

(22) By Keith Medcalf (kmedcalf) on 2021-01-02 22:24:51 in reply to 19 [link] [source]

I'm not up on al the hipster doublespeak of the day. What is FaaS? Does that stand for Floozies-as-a-Service or F*ck-as-a-Service (a new acronym for something that has been around for time immemorial)?

(14) By Dan Forsberg (dforsber) on 2021-01-02 17:30:08 in reply to 11 [link] [source]

SQLite3 doesn't do IO via Node, but directly with OS system calls. That's why the node docs do not apply as such. E.g. the better-sqlite3 module embeds the binary which runs in-band of the main Node thread (event loop). That's the reason, why the node event loop is blocked. Node uses libuv behind the scenes which uses threads for blocking FS system calls (roughly speaking).

(7) By anonymous on 2021-01-02 12:31:51 in reply to 1 [link] [source]

In SQLite3, concurrency is achieved by having multiple connections to the same database file running in different execution contexts (threads/processes). In it default configuration Sqlite3 allows either many readers to access the database at once or a single writer. If the journaling mode is set to WAL, then the writer can run in parallel to the many readers.

Furthermore, there is a specific Begin Concurrent branch that includes a new transaction mode that allows multiple writers to proceed in parallel while keeping the commit operation serialized, so the different threads/processes, each with a sqlite connection can proceed with a concurrent transaction but only one of them will commit at a time (they will block each other). In this mode conflicts are checked at commit time based on the state of the pages accessed by each transactions. And transactions in conflict will fail, returning an error to the caller. With careful schema/query design you can greatly minimize the potential for page conflicts

For my projects that require concurrency I usually use the begin Concurrent branch with retry logic in the application itself to recover from conflicts.

(29) By anonymous on 2021-01-04 12:06:50 in reply to 7 [link] [source]

But is the Concurrency branch a live branch? I.e. merging trunk changes on a regular basis? Per official release?

I don't remember reading anything about that anywhere. I did read that Expensify blog post,
so there are paying customer(s) using it, hinting at long term support, but that's just a guess,
not an official statement from Richard.

(30) By Warren Young (wyoung) on 2021-01-04 18:46:52 in reply to 29 [link] [source]

In lieu of an official statement, I'll observe that the branch in question was last touched 3 weeks ago. Seems pretty "live" to me.