WASM build & multithreading

(1) By mlaw (tantaman) on 2022-10-30 13:42:03 [source]

I noticed that the WASM build is built with -DSQLITE_THREADSAFE=0 https://sqlite.org/src/file?ci=trunk&name=ext/wasm/GNUmakefile&ln=127

I was curious if:

This is a limitation of wasm and that SQLite WASM will never support concurrent access
Or a multithreaded WASM build of SQLite will eventually be supported

In the meantime I'll give updating the build options a spin and see how much trouble I get into trying to do concurrent reads from sqlite in the browser.

(2) By Stephan Beal (stephan) on 2022-10-30 14:26:38 in reply to 1 [link] [source]

This is a limitation of wasm and that SQLite WASM will never support concurrent access

That's the short and the long of it, yes. It's not currently possible, in JS/WASM, to create a db handle which is accessible via multiple threads at once. Just yesterday i removed sqlite3_interrupt() from the WASM build because JS's execution model makes it useless there.

There are folks working on threading support in wasm, but none of it is standard (or even widespread) yet. Until it is, it doesn't seem like a great use of development time to play with it. We still have a number of significant wasm-related TODOs before we get to that point (not the least of which is getting it running, at least in a minimal form, under a build chain other than Emscripten).

Or a multithreaded WASM build of SQLite will eventually be supported

We can't make any promises in that regard. We're currently targeting only browsers and focusing on the JS side, as opposed to the low-level WASM side, and JS is single-threaded. As our own wasm skills mature, we will almost certainly branch out into helping support folks using sqlite3.wasm in non-browser and non-JS environments. Wasm is, however, still relatively new to us, and the full scope of directions we "could" go with it is (A) not clear and (B) expanding all the time as the various wasm runtimes and support libraries mature.

In the meantime I'll give updating the build options a spin and see how much trouble I get into trying to do concurrent reads from sqlite in the browser.

You won't be able to get multi-threaded access to any given (sqlite3*) handle via JS because its threading and runtime model simply doesn't allow for it. That object lives in a single thread and is in no way accessible (using standard technologies, anyway) to other threads.

You can "kinda" get concurrent access via setTimeout() and such, but you might end up deadlocking yourself if you do. We haven't yet experimented with setInterval()-based access to a db, for the simple reason that no use case so far has called for it.

You can get multi-threaded access to a single db file (as opposed to db handle), but it requires loading a separate copy of sqlite3.wasm/js in as many Workers and hosting the db in OPFS, as that's the only persistent storage backend which is accessible to workers (and OPFS is not available in the main thread). As coincidence would have it, a section about that was added to the docs earlier today:

https://sqlite.org/wasm/doc/trunk/persistence.md#xthread

(3) By mlaw (tantaman) on 2022-10-30 18:21:51 in reply to 2 [link] [source]

Thanks for the thorough answers.

You won't be able to get multi-threaded access to any given (sqlite3*) handle via JS because its threading and runtime model simply doesn't allow for it. That object lives in a single thread and is in no way accessible (using standard technologies, anyway) to other threads.

I did some research on the topic in the meantime. This article on web.dev, using webassembly threads from c, gave me the impression that a WASM module in the browser can spawn new WebWorkers whenever it needs a new thread and that emscripten supports this via their pthread option(s). The workers communicate by using the same memory via a SharedArrayBuffer -- https://web.dev/webassembly-threads/#sharedarraybuffer.

Can you help fill me in on what I'm missing that precludes sqlite from making use of this feature to hand out sqlite3* handles that each run in their own thread?

I guess this is pretty similar to the last option you describe except that you wouldn't need to load a copy of the sqlite3.wasm per handle. It'd look more like the main thread asking the worker (which is running sqlite) for a new connection and the worker providing a dedicated thread for that connection. I suppose its a bit weird in that the execution model would look like:

messages are posted from main -> worker, worker shuttles those messages to the right connection, query is finally in a separate thread.

(4) By Stephan Beal (stephan) on 2022-10-30 19:36:34 in reply to 3 [link] [source]

I did some research on the topic in the meantime. This article on web.dev, using webassembly threads from c, gave me the impression that a WASM module in the browser can spawn new WebWorkers whenever it needs a new thread and that emscripten supports this via their pthread option(s).

"Yes, but..." It's not portable (doesn't work on a Raspberry Pi, for example, whereas the rest of our wasm/js stuff does), not yet standard, and i've had a handful of problems using Emscripten builds which enable threads, up to and including crashes when loading modules:

https://github.com/emscripten-core/emscripten/issues/17951

The workers communicate by using the same memory via a SharedArrayBuffer -- https://web.dev/webassembly-threads/#sharedarraybuffer.

"Yes, but..." the devil is in the details. Communicating via SAB requires storing the data, in raw byte form, in that buffer on one end of the connection, communicating (via the JS Atomics API) to the other end that there is data pending, and having the other end deserialize and process it. Even so, you can only have one thing actually running in each thread, so you cannot multi-thread a single db connection. The db connection lives in one endpoint of the SAB/Atomics construct, and everyone talking to it is running in their own threads. They can postMessage() all they want to the db-holding thread, but that db is running in a single thread, and so long as it is running code, no other thread can preempt it and all postMessage() messages will wait in a black-box queue. It may provide an illusion of multi-threading, but the db code all happens in a single thread.

The postMessage() queue has its own problems: when you start posting messages to it, it may queue up any number of them before starting to execute the first one. If, say, the 3rd message hits some sort of fatal error, you might still have 13 more messages queued up and there is literally no way to cancel them. An SAB/Atomics construct can be used to bypass postMessage() altogether, but in a more limited form, namely in that the client code becomes responsible for the encoding and decoding of all data in the SAB on either end of the connection. It is possible (but slow) to use JSON.stringify() and JSON.parse() for that purpose, but it's not possible to know how big those data are going to be in advance, whereas an SAB is sized when it's created and cannot grow.

Can you help fill me in on what I'm missing that precludes sqlite from making use of this feature to hand out sqlite3* handles that each run in their own thread?

There's no technical issue with separate instances in separate threads except that each worker is a separate thread and each worker which has to have an sqlite3* of its own must load its own, 100% independent copy of the wasm module and all associated JS code. That's many tens of megs of RAM for a single db handle, which seems... somewhat excessive. There is no way to create an sqlite3* handle in one JS thread and then transfer it and access it from another JS thread. That's simply not something JS threads can do because each thread is its own complete JS sandbox. Yes, you can pass the pointer across threads, but the pointer is just a number: without the memory and other state backing it in the wasm runtime (which is unique per sandbox), it's meaningless.

What you can't yet do in a standardized (and stable) way, to the best of my admittedly very limited knowledge, is C-style multi-threading. i have not researched the cutting-edge threading options for wasm because until they're standardized and widespread they're unlikely to be a development target for us.

It'd look more like the main thread asking the worker (which is running sqlite) for a new connection and the worker providing a dedicated thread for that connection.

A worker (or the main thread - it doesn't matter) can spawn as many workers as they want, and each worker can start up as many db connections as they want, but each worker can only run one stack of code at a time and any given db connection is only visible to the worker which spawned it. Each such worker, however, requires a completely independent copy of the wasm module and all associated JS code, running in its own memory space, and that's expensive.

messages are posted from main -> worker, worker shuttles those messages to the right connection, query is finally in a separate thread.

A query and its results can be shuttled around to your heart's content, but the db connection cannot: it is born, lives, and dies within a single JS thread and has no way to escape that. While a db connection's owning JS thread is running a query, it cannot run anything else (which is why you cannot sqlite3_interrupt() it: a request to sqlite3_interrupt(), e.g. via postMessage(), is blocked until the thread stops running other code, at which point the interrupt request is useless).

Perhaps (perhaps) all of this is irrelevant with the up-and-coming wasm pthreads support, but that's so far down the list of TODOs/To-Investigates as to be irrelevant at this point. We must crawl before we can walk, and walk before we can run.

(5) By mlaw (tantaman) on 2022-10-31 00:04:49 in reply to 4 [link] [source]

Just to clarify -- it was never my intention to try to access the same connection from many threads. That seems like a bad design choice no matter the environment, nor would I have expected that to be possible in JS.

The idea was to spawn one connection per thread/worker, allowing independent workers to query the underlying db concurrently. Similar to what you'd do when using SQLite in Python, Java, etc. The main thread would treat the workers as a pool and load balance between them as it needed to issue queries.

I think the problem for JS finally clicked for me which I'll characterize as such:

sqlite, itself, doesn't do any thread management (other than being safe to access from many threads when compiled as such). It is up to the client to create connections in different threads if that client wants concurrent access to the db.
given the constraints of JS workers, (1) just isn't possible as it is in other languages. As you said, each worker needs an independent copy of the library completely isolated from the others.

I suppose it might be possible to do this pooling in native code which proxies sqlite then compile that bundle with emscripten's pthread support. Of course taking into consideration the earlier warnings about Emscripten builds which enable threads.

on a different note -- developers writing applications against the sqlite worker interfaces will likely hit a number of issues once they start trying to use transactions. The problem being that each operation is async so there are plenty of opportunities for other code to get in there in the middle of a transaction and try to start a new one.

E.g.,

await promiser('exec', {sql: 'begin' });
await promise('exec', {sql: '...'});
await promise('exec', {sql: 'commit'});

Given each statement yields control back to the browser's event loop, a user event could be processed at any time between statements, potentially kicking off a new transaction and resulting in: cannot start a transaction within a transaction.

I've worked out a few techniques to resolve this issue if you're interested or if you have other ideas I'd love to hear.

(6) By Stephan Beal (stephan) on 2022-10-31 07:52:08 in reply to 5 [link] [source]

The idea was to spawn one connection per thread/worker, allowing independent workers to query the underlying db concurrently.

Only with the illusion of "concurrently". The thread holding the db can only execute one thing at a time, so the requests will all queue up. It might have 3 or 10 or 17 handles in its pool, but only 1 can actually do anything at a time. JS is not going to be a good platform for highly-concurrent databases.

I suppose it might be possible to do this pooling in native code which proxies sqlite then compile that bundle with emscripten's pthread support. Of course taking into consideration the earlier warnings about Emscripten builds which enable threads.

That's my current take on solutions for that sort of problem, but that direction of development/exploration is nowhere on the current radar (it's all about JS bindings to the as-is C APIs right now).

Sidebar: it's my fervent hope to get us out from under Emscripten's grip at some point, not because "Emscripten is bad" or anything of the sort, but because being locked into a single toolchain causes me to lose sleep. The hard truth is, however, that Emscripten is currently leaps and bounds ahead of all other toolchains and getting completely out from under it might not even be realistic. i've gone way out of my way to eliminate all JS code-level dependencies on Emscripten and structure the various JS files so that they can readily be rebuilt for use with other toolchains. The library has to be "bootstrapped" with a small amount of config info, primarily to feed it state like the wasm heap memory and wasm exports, but it doesn't care where those come from.

The problem being that each operation is async so there are plenty of opportunities for other code to get in there in the middle of a transaction and try to start a new one.

That is absolutely true.

I've worked out a few techniques to resolve this issue if you're interested or if you have other ideas I'd love to hear.

i am most definitely interested in that and will study that code.

In the OO API we have methods transaction() and savepoint(), both of which take a callback argument and run that callback inside of a transaction/savepoint. If the function throws, the transaction is rolled back, else it's committed. i don't see a way to reliably do such a thing in the current Worker/Promiser interfaces unless all relevant SQL code is packed into the same message (which is easy to do but may also limit the app flow considerably (such is the price of async usage)).

According to this article, the order of promise execution is well defined. Perhaps (perhaps) we can use that to ensure that transaction-grouped SQL (submitted via a single client-side promise chain) is executed in the correct order. That's something to investigate, but would still have limitations such as not being able to run nested queries during the transaction.

In the current API, transactions via workers will only work properly if the (single) user of the worker does not interleave transactioned code with non-transaction code. That seems like a very reasonable limitation to me: when we write sqlite in C we start a transaction, run its code, and end the transaction. We don't jump in and out of the transaction from a single thread. Frankly, it never occurred to me that there might be more than one listener because the worker/listener interface does not permit more than one listener per worker.

However... it would be possible if we invert the load order of the relevant JS, so something like:

app.js loads sqlite3.js into its own thread
app.js starts N workers who want to work with the db(s) via postMessage()
app.js coordinates that communication, making sure that all results go to the proper listener.

We don't yet have an API for that style of communication, probably because it hadn't come up until now. That's something which will need looking into as time allows.

(7) By anonymous on 2022-10-31 09:00:14 in reply to 6 [link] [source]

Considering all these limitations, is the effort spent on WASM really worth it?

(8) By Stephan Beal (stephan) on 2022-10-31 09:09:04 in reply to 7 [link] [source]

Considering all these limitations, is the effort spent on WASM really worth it?

These are limitations of multi-threaded use brought on by friction between how JS and C model threads. Applications which make single-thread use of a db are not hindered by this. To the best of my knowledge, most sqlite apps are single-threaded.

Folks who want to make it difficult on themselves and try to introduce complex cross-thread communication scenarios can certainly do so at their own peril. Folks who take an approach to working with a db which is better suited to the limitations of the hosting environment will find it easy to work with.