Questions about running SQL queries in parallel for performance

(1.1) By Simon Willison (simonw) on 2022-04-28 03:45:54 edited from 1.0 [source]

I'm building a Python web application which executes many SQL queries on a single page. Some of these queries might be quite long - a count(*) on a million row table, plus several group by / count queries against columns on that table.

Since these queries are not dependent on each other's results, I've been exploring running them in parallel.

I've got something working, but I'm not seeing the performance improvement I was hoping for. My page is taking approximately the same time to render with queries in parallel as it does in serial. I've confirmed that the parallel queries are executing in a way that overlaps each other.

I'm using a pool of threads (three by default) in Python for this parallel execution. Each thread has a dedicated connection to the same SQLite database file.

Before I start learning to run C profilers against my Python code, I thought I’d check here to see if there were any larger important concepts I’m missing.

My fundamental questions are the following:

Assuming a multi-core machine, should I expect a page load performance gain from running SQL queries in parallel against the same SQLite database file rather than in series, from within the same process?
Are there any specific gotchas I should look out for with regards to using multiple connections for this? Is there a reason I should try to run these parallel queries against a single connection instead?
Are there any additional known limitations I should be aware of with regards to attempting this using the sqlite3 module in the Python standard library?

Really I want to check that what I'm trying to achieve is feasible before spending even more time on this.

You can read my full notes about my progress on this so far in this blog entry and this ongoing GitHub issues thread.

(2) By anonymous on 2022-04-28 02:09:03 in reply to 1.0 [link] [source]

Python has a GIL (global interpreter lock) which means that while the threads are running concurrently, they are not running in parallel. You can check any cpu monitoring too to verify that this is indeed the cases, you should see most activity happening at one core at a time regardless of how many threads you use to do the work.

(3.1) By Simon Willison (simonw) on 2022-04-28 02:48:37 edited from 3.0 in reply to 2 [link] [source]

The sqlite3 Python module releases the GIL here: https://github.com/python/cpython/blob/f348154c8f8a9c254503306c59d6779d4d09b3a9/Modules/_sqlite/cursor.c#L749-L759

My current assumption is that this means that the SQLite code can run concurrently across multiple CPUs, even when the Python code that wraps it cannot. But maybe I'm misunderstanding how the GIL works?

(4) By SeverKetor on 2022-04-28 03:14:50 in reply to 3.1 [link] [source]

My current assumption is that this means that the SQLite code can run concurrently across multiple CPUs, even when the Python code that wraps it cannot.

That's correct. With the right setup you can max out all CPU threads like this, instead of being limited to one thread. You can test yourself by devising some very slow queries and running them in different Python threads. They'll complete obviously faster than if they were serialized.

(5.1) By Simon Willison (simonw) on 2022-04-28 03:44:50 edited from 5.0 in reply to 4 [link] [source]

That's reassuring! It sounds like my quest to run parallel SQL queries in my Python web application while taking advantage of multiple CPU cores isn't a completely wild goose chase then.

Do you have any further hints in what "the right setup" might look like?

(6) By Keith Medcalf (kmedcalf) on 2022-04-28 03:49:24 in reply to 3.1 [link] [source]

This is true, but the concurrency is limited to the execution which occurs with the GIL released (that is, in the native C sqlite3 library itself). Each row (for example) can be retrieved in parallel but "constructing the python return objects for each row" will be serialized (by the GIL).

That is to say that if your have two python threads each with their own connection, and each one is performing a select that returns 1,000,000 rows (lets say that is 25% of the candidates for each select) then the difference in execution time between executing two python threads in parallel vs a single serial thead will not be much different (if even detectable at all). In fact it is possible that the multiple-threaded version takes longer to run both queries to completion because of the increased contention over a shared resource (the GIL).

(7) By Simon Willison (simonw) on 2022-04-28 04:11:48 in reply to 6 [link] [source]

That sounds like it might explain why I'm not seeing any improvements in my code that executes in parallel.

I'm going to guess that the GIL-free SQLite code executes so fast that the GIL-encumbered Python code that does things like assembling the Row objects takes up most of the time.

I should do some experiments with SQL queries that I know take a long time in SQLite terms - maybe some expensive aggregations - and see if I can spot an improvement with those.

As for my Python code, maybe I should look into running one Python process per CPU core and see if I can balance SQL queries across those instead.

(8) By RandomCoder on 2022-04-28 04:45:20 in reply to 7 [link] [source]

As for my Python code, maybe I should look into running one Python process per CPU core and see if I can balance SQL queries across those instead.

You might consider looking into using Python's multiprocessing module instead of threading. It does exactly this with multiple Python processes, along with some helpers to exchange data between the processes in various ways.

(9) By Simon Slavin (slavin) on 2022-04-28 05:01:19 in reply to 1.1 [link] [source]

I am thankful that those familiar with Python internals have handled the Python side of this, and will tackle a different aspect.

SQLite is generally input/output bound. Or you might call it storage bound. The processing done inside the SQLite API is efficient. The way data is packed into a row and written to disk is efficient. That leaves the other thing a DBMS does during 'many SQL queries': reading from storage.

If you do have three python threads processing queries at the same time, they're all reading from storage at the same time. There's a chance they're fighting for access to the same file on the same drive, or for whatever hardware is between the drive and memory. So you have three threads running but most of the time two of them are just waiting for data.

You can test for this a few ways. One is to try whatever you're doing now on a drive which is far faster or slower than the one you're using. Perhaps plug in an old slow external drive, duplicate your file onto that, and tell your test program to address that one instead.

If using a drive which is half the speed makes your program take twice as long, your program is storage bound: all the time it takes is wrapped up in how fast it can access its data. If using a drive which is half the speed makes your program take only a little longer, your program is processing-bound: it's spending most of its time doing processing, and the speed of the drive doesn't matter much.

You apparently have a simple way to specify whether your program can run the queries in parallel. It should be possible to set up a test grid of parallel vs. serial, slow vs. fast storage, and identifiy which change governs your overall speed.

(10) By JayKreibich (jkreibich) on 2022-04-28 05:31:50 in reply to 9 [link] [source]

Yes. If you don't have enough RAM to basically keep the whole database in kernel file-system cache at all times, this whole exercise doesn't really matter. If you're trying to get concurrent performance out of something as simple as a count(*), you're I/O bound. Having parallel queries trying to touching different parts of the database file will actually slow everything down, even on an SSD, making overall performance worse.

Regardless of C vs Python, the Python GIL, and all the other issues, the whole exercise depends on the database data live in memory. That can be a page cache, or the file system cache, but it can't be storage devices.

(11) By Donal Fellows (dkfellows) on 2022-04-28 13:34:46 in reply to 1.1 [link] [source]

Some of these queries might be quite long - a count(*) on a million row table, plus several group by / count queries against columns on that table.

As has been pointed out, you're more likely to be I/O-bound than CPU-bound for this sort of thing (and you already report that the naïve approach isn't helping much). You probably ought to inspect whether you have useful indices set up, as they can make a profound performance difference on tables of that scale. The only time you want to see full table scans is really when you intend to actually do something with every row.

The only times I've hit real performance problems so far have been:

When doing a very write-heavy workflow. (In my case, I turned of sync calls except during finalization, as the application didn't care about a half-written database, and collected the writes in a single thread that was fed by queues from the rest of the code. And I stopped trying to concatenate BLOBs; don't do that.)
When I'd put a custom function in... that happened to end up being implemented by the (deliberately slow) bcrypt() crypto function. Oh well!

(12) By Simon Willison (simonw) on 2022-04-28 18:48:48 in reply to 11 [link] [source]

I'm building software for other people to use, so while I can give them tips on adding indexes I'm interested in getting the best possible performance out of poorly optimized tables as well.

(13.1) By Simon Willison (simonw) on 2022-04-28 19:11:31 edited from 13.0 in reply to 10 [link] [source]

The software I'm writing is often used against databases that are less than 100MB in size - so presumably the entire database file can easily end up in the kernel file-system cache.

The databases also tend to be read-only - does that mean they're more likely to end up in the file-system cache, since they're not accepting any writes?

If there IS a performance improvement to parallel queries against a database that has been entirely pulled into the file-system cache then I'm very interested in taking advantage of it. Many production servers these days have multiple GB of RAM (AWS will rent you a server with more than a TB) so I like the idea that I can tell my users that if they want a big performance boost they can throw more RAM at the problem.

I'm not yet sure how to maximize the performance impact of additional RAM on SQLite though - that's a whole other set of performance questions I need to ask and answer!

(14) By Simon Willison (simonw) on 2022-04-28 19:10:59 in reply to 9 [link] [source]

I've wondered if multiple processes accessing the same file could have a performance impact.

Since my software is often used against read-only database files that never change, is it likely that I'd see any benefits from creating three copies of the same database file and opening connections against the copies? Could that enable more parallel queries?

(15) By Simon Willison (simonw) on 2022-04-28 19:13:22 in reply to 9 [link] [source]

Maybe I should investigate the performance benefits of copying the entire database into a :memory: SQLite database on server startup, then serving queries from that.

(16) By Simon Willison (simonw) on 2022-04-28 22:04:51 in reply to 15 [link] [source]

I just wrote a plugin that does that: https://github.com/simonw/datasette-copy-to-memory - demo here, where https://latest-with-plugins.datasette.io/fixtures_memory is an in-memory copy of https://latest-with-plugins.datasette.io/fixtures

(17) By Keith Medcalf (kmedcalf) on 2022-04-28 22:20:59 in reply to 8 [link] [source]

This would work if the subprocess "built" the entire result rowset in that subprocess and then 'folded' the entire resultset back into the main process as a single-shot (that is the subprocess would have to process the entire query, collect up all the results, and return them all at once as a single object).

(18.1) By Simon Willison (simonw) on 2022-04-30 21:29:45 edited from 18.0 in reply to 7 [link] [source]

In an exciting twist: I proved to myself that this WAS a problem with the GIL, by testing it using the extremely promising https://github.com/colesbury/nogil fork of Python that removes the GIL.

Results of my tests are here - short version is that using that fork of Python caused my parallel query version to soundly beat my serial query version: https://github.com/simonw/datasette/issues/1727#issuecomment-1112889800

(19) By Simon Slavin (slavin) on 2022-05-01 13:52:56 in reply to 16 [link] [source]

I am curious to know whether your experiments reveal anything about what does or does not increase speed in your situation. For example, does multiprocessing speed things up ? Does caching the database in memory speed things up ?

(20.1) By Simon Willison (simonw) on 2022-05-01 22:44:49 edited from 20.0 in reply to 19 [link] [source]

So far the biggest result I've got is that using the nogil Python branch from https://github.com/colesbury/nogil (via the Docker image) gave me a notable performance increase for my parallel queries running in separate threads.

This is going to be a long-running research process for me. If you're interested, I suggest subscribing to this issue thread: https://github.com/simonw/datasette/issues/1727 - though I'm going to take some time off this research to work on different things for a while, so it may not be very active for a bit.