Will sqlite serialise queries from multiple threads?
(1) By anonymous on 2022-08-19 11:14:38 [link] [source]
I understand that "threads are evil". However, I'm running on Windows, where setting up multiple processes is far harder, so at least for now I'm using them. Sorry :-)
My application is running many queries in parallel, on different threads. Each query takes a non-trivial amount of time. My experiments suggest that sqlite is only running one query at a time - is that correct? The FAQ says that the engine uses mutexes to "protect essential data structures" - does that include structures needed for read-only queries, and is it implemented in a way that will serialise the whole query?
Putting this another way, is there any way that I can set up a threaded application to run multiple queries in different threads, and have those queries run in parallel? Or do I have to use multiple processes to get that behaviour?
I'm using Python, so my access to the DB engine is essentially via SQL - I can't easily call C-level APIs unless they are exposed by Python. But I am willing to dig around to try to find ways of configuring things via the Python API.
(2) By Simon Slavin (slavin) on 2022-08-19 12:55:25 in reply to 1 [link] [source]
Are all threads using the same database connection ? Or does each query make its own connection ?
If you execute
PRAGMA journal_mode;
on the database, what do you get ? If it's not WAL
then execute
PRAGMA journal_mode=WAL;
on the database, just once (the change is saved in the file), then try running your program again. If it makes things worse set it back to the default mode, DELETE
.
You can do all the above using the SQLite command-line tool. You don't have to write a little program to do it.
(4) By Paul Moore (pf_moore) on 2022-08-19 17:00:40 in reply to 2 [link] [source]
[OP here] Yes, I'm in WAL mode. I share one connection across all threads. I'll try one connection per thread, but I'm using a thread pool so that'll need a little more work to test. Thanks for the suggestion.
(6) By Keith Medcalf (kmedcalf) on 2022-08-19 18:27:27 in reply to 4 [link] [source]
Each SQLite3 connection is serially re-entrant on different threads.
That means that a connection (and everything derived from that connection) can be called on any thread. Serially. One after each. No more than one "thread at a time" per connection.
This is because there is "state" associated with the connection, and a mutex around that state information is required to ensure that it is not changed by multiple threads. THerefore one thread is permitted to execute, and any others wait "at the door to the connection" waiting for their turn to serially enter the connection and do their thing.
You need one connection per thread (and each thread shall only deal with its own connection and decendants of its own connection). Isolation is between connections, not children of connections (such as statements). There is no isolation no concurrency within a single connections descendants.
(11) By Paul Moore (pf_moore) on 2022-08-20 09:19:25 in reply to 6 [link] [source]
Thanks, that helps clarify a lot.
I extended my experiment - I ran 20 threads, all running a select query that takes about 1 second. Running with a single connection takes around 26 seconds, which seems about right for no serialisation.
Running with 1 connection per thread, the total time was still around 13 seconds. That confirms that the threads are now running in parallel, but the saving isn't as good as I'd hoped. I have 16 logical processors so while getting 1/16 of the time wasn't likely, better than 1/2 seemed like a reasonable expectation. But it's likely the bottleneck now is elsewhere (maybe Python's GIL - there's still a bunch of Python code in those threads that might be getting serialised). I'll need to do more investigation.
(12) By Donal Fellows (dkfellows) on 2022-08-22 09:09:17 in reply to 11 [link] [source]
The GIL would be highly suspect in this case. I was coincidentally reading Simon Willison on this very topic at the weekend, and it appears that the GIL can be quite a problem. There is apparently a nogil fork of Python that might help you; I don't know how much additional work would be required for you to take advantage of it.
(13) By Paul Moore (pf_moore) on 2022-08-22 19:57:26 in reply to 12 [link] [source]
Sigh. I'd seen Simon's article, but forgotten about it. Thanks for the reminder. I don't really want to get into needing the nogil fork, so I think I'll look at restructuring my code to avoid the need for highly parallel queries. I can probably load the data I need into memory with a single query at the start, and then check and filter in my application instead. That might help...
(3) By anonymous on 2022-08-19 13:43:47 in reply to 1 [link] [source]
There are possibly multiple reasons why your queries are running serially, or appear to be running serially. In a multi threaded application one of those reasons could be the locking around memory stats collection. Which is something you can disable when you compile SQLite as can be seen in the docs:
Memory allocation statistics are enabled by default unless SQLite is compiled with SQLITE_DEFAULT_MEMSTATUS=0 in which case memory allocation statistics are disabled by default.
(5) By anonymous on 2022-08-19 18:22:24 in reply to 1 [link] [source]
Since you're using Python, this thread is likely to be relevant.
(7) By Ryan Smith (cuz) on 2022-08-19 19:10:32 in reply to 1 [source]
I understand that "threads are evil". However....
That's just a defeatist rumour spread by lesser programmers. Pay it no mind.
It's not evil, just dangerous. Much like doing welding - an indispensable tool to get anything metal built, but also able to hurt you very much. Ensure everything is properly grounded though, ensure you wear your gloves and darkening goggles, and you'll find it perfectly safe.
In SQLite, threading is safe within the rules. Part of the safety rules are that you must use a connection per thread and either leave serialization in the hands of SQLite, or do it yourself properly, otherwise everything will explode.
I'm sure you don't need more nudging than that, and others have already touched on the important notions around using SQLite thread-safe, I will just add to absolutely test under full load, mimicking as-close-as-possible the real-life concurrency and operating conditions you will face. With multi-threading setups, you cannot skimp on testing.
(8.1) By Warren Young (wyoung) on 2022-08-19 19:19:41 edited from 8.0 in reply to 7 [link] [source]
lesser programmers
You do realize you're tossing drh and a Berkeley CS professor into that bucket, right?
perfectly safe.
Nondeterminism is never "perfectly safe." It rules out vast classes of formal methods, because it removes the software from the realm of mathematics. At that point, you have whole catalogs of mismatches between human expectations and what the computer actually does.
(9) By Ryan Smith (cuz) on 2022-08-19 21:24:50 in reply to 8.1 [link] [source]
You do realize you're tossing drh and a Berkeley CS professor into that bucket, right?
No, I did not realize that, but now that I do, I'm not buying it 100%. I can't speak for the Berkley professor, but I have it on good authority that SQLite itself uses multi-threading to accomplish internal tasks, such as sorting. How does one reconcile that statement with someone using the very thing that is professed to be evil?
If it truly was evil as a base truth, we'd see less and less of it, not more and more like we actually do.
I have a pet theory about it - if you wish to hear, here's my honest 2c:
I believe that the only reason programmers are averse to multi-threading is that we don't teach it at ground level. Chances are everyone reading this wrote their first lines of code in a neatly fenced-off single-process, single-threaded island, and ditto for the next piece of code, and so on and so forth until one day they were forced to consider multi-threading for some extraneous reason, perhaps parallelism or speed gains demanded of their project, perhaps using tools that require it, etc. - or perhaps that did not happen yet for the reader.
Imagine if it was just a normal thing from day one? We would understand the patterns the same way we understand loops, variables and the other most basic concepts now. This skill in turn will enhance everyone's ability to write and understand code with all the intricacies of multi-threading to the point where you no longer are surprised by the typical (and atypical) bugs earning it the current "evil" label.
Put another way, if everyone became more proficient at making multi-threaded things, the typical multi-threading headaches and bugs will slowly fade away.
I think if programmers simply were told that multi-threading is just the way it works, from day 1, and that to think of the code as not a single set of instructions on an island, but indeed a group of instructions in a larger traffic system where other traffic also move about, can bump into you if not careful, having to obey semaphores (traffic-lights perhaps) and ensuring code is re-entrant, when to serialize and when not to, and always visualize that the variables are kind-of state machine-like with clear differences between those memory that belong only to you, and that you share with other threads and processes and the rules for touching it, then the world of multi-threading would look very different to such a programmer. We do that when teaching people to drive - they can scarcely keep a car going straight then we shove them into traffic and teach them how to obey road signs, how to drive on only one side, and respect other traffic. We do it because it is necessary to survive on the road.
If multi-threading was necessary to survive making computer algorithms, that would be the norm. It is however easy to achieve 95% of all programming tasks without ever touching multi threads - and that is the real evil.
If multi-threading was simply "the way", then everyone will just be OK with it, and well-optimized single-threaded algorithms would be a good tool in the toolbox in stead of the single-minded "way-of-the-world" that most people are caught in now.
Aside rant:
I'm also annoyed thinking back on how I myself always avoided multi-threading anything, skulking away to single-threading safety, even making excuses for why single threading was "really the best way" - all because of the way I was bitten the first few times I tried making multi-threaded things. Those weren't particularly nasty bugs, it's your typical MT bugs in fact most people with any multi-threading experience would spot in a heartbeat, but it was very UN-intuitive to me and I couldn't immediately wrap my head around it, which made me probably react like most programmers in the same situation: call it "evil" and move on, refactoring to single-threaded means of doing. When finally forced to go multi-thread, it was such a paradigm shift. Now it just saddens me that I didn't learn it much earlier. It's not crazy hard at all, just a bit foreign when you are used to thinking on an island.
I like to compare multi-threading to trigonometry. It's a bit weird when you learn it the first time, and unintuitive and very different to normal algebra, but once you grasp it, it's not only perfectly easy, but opens up worlds of easy solutions for a specific class of problem that range from very hard to impossible to figure out using base algebra.
Unfortunately (or fortunately?) near every project I am involved with is heavily multi-threading dependent, and while it's all good and dandy that I'm no longer daunted by it, new programmers that join, even seasoned ones, have a steep and extended learning curve before becoming useful to the project - and I just feel that shouldn't be the case. /Aside
We should honestly teach it in the very first part of programming curricula, and perhaps stop telling everyone that it is evil but adopt a more "you can do it!" attitude.
(10.1) By Warren Young (wyoung) on 2022-08-19 23:37:15 edited from 10.0 in reply to 9 [link] [source]
SQLite itself uses multi-threading to accomplish internal tasks, such as sorting
You can't use low-level software written by full-time experts with ridiculous amortization schedules to justify use of threads by individual code-slingers.
Threads in DBMS engines? Fine; they're run by millions, tested to a ridiculous degree, and very carefully enhanced.
Threads in end-user database applications? Not fine unless you've got the development budget to squish out all the bugs you buy by doing that. Your app probably isn't run by millions, it probably doesn't have anywhere near the test code coverage of SQLite, and it's most likely enhanced on a schedule set by your manager, according to non-technical requirements.
It's evil to spend 10-100x the necessary development cost merely because you've reached for a too-low-level development construct.
If it truly was evil as a base truth, we'd see less and less of it, not more and more like we actually do.
We're seeing more and more use of threads because Moore's Law Business Rule ran out decades ago, and we've been limping by on increasingly higher piles of wacked-out abstractions that try to present every computer system — no matter how complicated — as a really really fast PDP-11 when that's nowhere near the truth and getting increasingly further from it.
Thus all these speculative execution hacks, thus multi-core, thus cloud architectures.
We wouldn't even need fancy multiprocessing methods in 99% of cases if we'd stop writing performance-critical software in languages like Python, which impose a 50x slowdown on your program. It'd run fast enough on a single core.
We would understand the patterns the same way we understand loops, variables and the other most basic concepts now.
Ridiculous. If that was true:
Your average book on multithreaded programming wouldn't be as long as K&R, which covers loops, variables, and all the other basic concepts…but not threads.
The exceedingly common and seemingly primitive — nay, even atomic — programming language construct "++i" wouldn't contain a data race condition.
The top search result on Amazon in the CS book category for "threads" wouldn't be, "Seven Concurrency Models in Seven Weeks: When Threads Unravel."
We wouldn't have Erlang ("processes", the actor model), Go (goroutines), .NET async/await, Javascript promises, and on and on. The world is full of good alternatives to threads. I expect all of them use low-level OS/CPU threads underneath, but that just takes us back to the testing and amortization matter above. The companies behind these languages can afford the cost of taking that development, testing, and maintenance burden on, so the users do not have to. The costs amortize out to near zero at the edge.
Let the pros write threaded code; I shall not, unless forced.
…if programmers simply were told that multi-threading is just the way it works, from day 1…
…then we'd have an awful mess.
And if we look around, we have…an awful mess. 😛
a larger traffic system
That's an application of control theory, which can only be applied to nondeterministic systems within certain performance bounds. You can set rules like, "Two vehicles cannot occupy the same space," and "Vehicles can neither accelerate nor decelerate instantly," and "Vehicles are no more than 8½ feet wide and 60 feet long." With a suitable array of such facts, you can build a fairly accurate traffic simulation.
The problem with computer science is that we've abstracted it so far above the underlying physics that it's a special treat when any of that applies. Can two pieces of code share the same memory space? Sure, that's mmap()
. Can programs start and stop "instantly"? Sure, that's multiprocessing. Can we define the "shape" of a program so as to set the amount of "lane space" it takes, to extend this analogy? Heck no; the size and shape of most programs are dynamic.
SQLite fails to match up with any of these analogies. How, then, can we model it as a traffic flow system, when the underlying mathematics are inapplicable?
Now realize that SQLite is unusually testable. It's a well-defined language of the old input-process-output school, where known inputs translate to the same fixed outputs every time, modulo fine details like rand()
.
Contrast a GUI app that must react to pseudorandom mouse events, uncontrollable network I/O patterns, interference from other programs via the local shared persistence medium of choice, and highly unpredictable scheduling delays. How can you possibly prove that you've tested all the code paths through it? As far as I can tell, the list of tasks toward that goal begins, "First, we need to simulate the world."
I like to compare multi-threading to trigonometry
Trigonometry is based on a small number of axioms, from which you can produce a coherent system of formal proofs, and then in turn new proofs from them.
Computer scientists — promulgators of formal methods — and software developers — pragmatic tool-users — have almost nothing to say to one another. The code slingers almost universally ignore the profs, since they rarely produce anything applicable to the developer's daily tasks.
I'm not saying formal methods have zero application. They do sneak into the pragmatic developer's toolkit from time to time, as with static analysis tools in modern compilers.
What I am saying, though, is that if you ask a software developer to sit down and write a formal proof of correctness for their latest program, chances are excellent that they won't be able to. Even those trained in formal CS will often fail, because once you add an element of nondeterminism, most of the techniques they've studied go out the window. Threads exacerbate this.
(Thus Haskell's IO monad, an attempt to wall off nondeterminism so that at least some of the program can be proven formally correct.)
adopt a more "you can do it!" attitude.
That's got echoes of "everyone should be a programmer" to it, to my ears. A beautiful thought, but not likely to come true, ever.