Tips for getting SQLite under Python to perform better with more RAM?

(1.2) By Simon Willison (simonw) on 2021-07-16 18:41:27 edited from 1.1 [source]

I'm interested in getting better performance out of SQLite on machines with more RAM, using the Python sqlite3 module from the standard library.

(Tips for more CPUs would be useful too - I'm interested in understanding what the benefits of paying for larger EC2 instances are, in addition to maximizing performance against currently available memory.)

I've seen PRAGMA cache_size in the documentation - is this the main lever I should be looking at or are there other options that might help here?

My current plan is to set up a simple load test that exercises my Python + SQLite code across a number of different queries, then modify the cache_size and see what impact that has on requests/second and request durations.

I'd thoroughly appreciate any advice on ways to scale SQLite up like this.

Most of my workloads are read-only (I open the SQLite file with the ?mode=ro or ?immutable=1 flags) so I'm mainly interested in improving performance of SELECTs against read-only databases.

(2) By anonymous on 2021-07-16 18:56:35 in reply to 1.2 [link] [source]

You should try out the mmap pragma, it might be worth it to try out the difference of its effect compared to cache_size, also given that you are using Python, which limits the usage of multiple cores from a single process (due to the GIL) then I would recommend trying to scale the number of processes accessing the database file.

(3.2) By Simon Willison (simonw) on 2021-07-17 17:35:41 edited from 3.1 in reply to 2 [link] [source]

Thanks, I'll take a look at mmap too: https://www.sqlite.org/pragma.html#pragma_mmap_size and https://www.sqlite.org/mmap.html

Interestingly the Python sqlite3 library (which is written in C) releases the GIL when executing queries, which makes me hopeful that SQLite in Python can take advantage of multiple cores even within a single Python process. I haven't figured out a good way to confirm that this is happening yet though!

(4) By anonymous on 2021-07-17 21:22:09 in reply to 3.2 [link] [source]

Nice to learn about SQLite3 and the GIL, in that case you should consider compiling with https://sqlite.org/compile.html#default_memstatus set to off, so that SQLite does not serialize all its memory allocations.

(5) By Keith Medcalf (kmedcalf) on 2021-07-17 22:20:19 in reply to 1.2 [link] [source]

Define "getting better performance", in particular what "performance" is lacking, and why do you believe that you can "getting better" and for what purpose?

(6.1) By Keith Medcalf (kmedcalf) on 2021-07-18 00:41:29 edited from 6.0 in reply to 1.2 [link] [source]

(Tips for more CPUs would be useful too - I'm interested in understanding what the benefits of paying for larger EC2 instances are, in addition to maximizing performance against currently available memory.)

The standard CPython does not "do" multiple simultaneous threads of execution in CPython bytecode, so although CPython bytecode threads of execution may be executed "interleaved", they are never executed concurrently. So adding CPUs (or fake CPUs aka Cores) will not do anything with the Pure Python code. No matter how many "threads" of python bytecode you think you are executing, only ONE will execute at a time and all told cannot consume more than 1 CPU.

Once the Python Bytecode has "transitioned" into "machine space", the code executing on the machine (that is, SQLite3) may use multiple CPUs simultaneously if it is configured to do so (the default). However, except for exceptional circumstances, a single entry executes entirely only on one thread (subsorts may use additional CPU threads, but that is only when constructing an index or processing a GROUP BY or ORDER BY sort).

If you have, for example, a dozen python threads, those will execute AT THE PYTHON LEVEL interleaved on a single CPU core. However, when one of those threads calls sqlite3, the sqlite3 processing will execute concurrently with other Python code threads that are "ready to run" dispatched in interleaved fashion.

I've seen PRAGMA cache_size in the documentation - is this the main lever I should be looking at or are there other options that might help here?

Pragma cache_size increases the memory storage pool for pages read on the current connection. A page which is stored in the connection cache which has not been invalidated will be accessed from the cache without generating an interrupt to the OS to perform I/O. This will be faster than generating a kernel I/O request, even if that kernel I/O request can be serviced without performing a physical device I/O (as in serviced by the OS from an OS level cache).

My current plan is to set up a simple load test that exercises my Python + SQLite code across a number of different queries, then modify the cache_size and see what impact that has on requests/second and request durations.

This is a good plan.

I'd thoroughly appreciate any advice on ways to scale SQLite up like this.

There are other strictures in the sqlite3 library.

For example, if you have MEMSTATUS tracking enabled, then only one "sqlite3 thing" at a time can allocate/deallocate memory. Turning this off can speed up sqlite3 even if you are only executing one sqlite3 operation at a time because there will no longer be a need to check to ensure the single admittance requirement.

Each "connection" can also only be executing one thing in sqlite3 at a time. By default there are "safety features" which prevent you from violating this requirement by ensuring single entry on a connection. You can turn this checking off if you are certain that you are complying with single entrance per connection requirements (see https://sqlite.org/threadsafe.html)

Most of my workloads are read-only (I open the SQLite file with the ?mode=ro or ?immutable=1 flags) so I'm mainly interested in improving performance of SELECTs against read-only databases.

There is a difference between mode=ro and immutable=1.

The former (mode=ro) means that I will be only reading the database but that someone else may be concurrently updating, so perform all the checks necessary to ensure that the database is in a consistent status.

The latter (immutable=1) means that NO ONE will be updating the database so NEVER perform any of the checks to ensure that the database is in a consistent state because it will never change during the period the database is open with this flag.

This means that "more=ro" means that someone else can update the database, and that if they do so, then the current page cache will be invalidated and all page requests will generate an I/O request to the OS, although the OS may still be able to satisfy the request from the OS level cache without actually perroming a physical I/O operation, the CPU must still enter Supervisor mode to perform the I/O operation.

However, with "immutable=1" no checks are done to see if the database is being updated by someone else and if someone else does update the database then shit will likely break loose and hit the oscillatory device, smearing everything in sight with excrement ...

See https://sqlite.org/c3ref/open.html

(7.1) By Simon Willison (simonw) on 2021-07-18 03:20:10 edited from 7.0 in reply to 5 [link] [source]

If I run arbitrary SQL SELECT queries against a ~600MB database file the queries take as short a time as possible (in ms) to execute. My hope is that more RAM = more data queried in memory rather than being read from disk = faster executing queries.

(8.1) By Simon Willison (simonw) on 2021-07-18 03:18:26 edited from 8.0 in reply to 6.1 [link] [source]

This is really useful information, thanks!

I'm confident I'm using immutable and ro correctly in this case - I use ro if there might be other processes updating the SQL database file (often cron scripts). I use immutable when I can absolutely guarantee nothing will update the database - often that's because I'm hosting it on Google Cloud Run which doesn't even provide a persistent writable file-system.

(9) By Keith Medcalf (kmedcalf) on 2021-07-18 20:08:15 in reply to 7.1 [link] [source]

The data is only loaded from the actual device to the cache when it is read.

The "first read" must always be to where ever the data is stored (physical device, OS cache if it has been read before by the OS and cached in the OS cache, application cache if the application has previously read the data and stored it in the cache).

Effects of increases in memory will have no effect on the time required to fulfill the first request, but may reduce the time required for subsequent accesses. (I say may because the differential in time between doing an I/O to a physical device vs using the same data already read varies greatly depending on the storage device. The time required to do a physical device read may be astronomical if the underlying device is a secretary and the data you want is located on a different continent (especially if the airlines are not operating at full capacity yet). On the other hand, if the device is a PCI x4 connected SSD, then there may be very little detectable difference).

If the connection is closed, the application page cache is discarded. Each new connection is created with an empty page cache.

If the underlying database file is updated by a different connection, the application page cache is discarded. (A mode=ro connection will detect the change and invalidate the cache -- an immutable=1 connection will not detect the change and will not invalidate the cache)

So yes. If you are holding the connection open to perform multiple operations and the database is not updated (or is immutable), then having a sufficiently sized page cache will increase the access speed of pages after they have been read once.

However, it takes time to search the page cache for the desired page, and eventually the time taken to "search the cache" will exceed the time it would take so just "do the I/O" (by this point it is generally just a transition to Supervisor mode to service the I/O request from the OS cache).

Where exactly this sweet spot lies varies considerably depending on the particular configuration of the host computer (and also the workload).

Generally speaking optimum performance will be obtained somewhere around a 95% cache hit rate, and "writing" or "updating" will require sufficient cache size to maintain the 95% hit rate and hold the working change-set (so bigger cache is required for read/write operations, smaller cache works for read-only).

I have pre-configured my version of SQLite3 to always use 1 GB of application page cache per connection on 64-bit machines (pragma cache_size=-1048576 by setting SQLITE_DEFAULT_CACHE_SIZE to -1048576 when compiling the library).

(10) By Keith Medcalf (kmedcalf) on 2021-07-18 20:21:19 in reply to 8.1 [link] [source]

Good. There is also the issue of this so-called new-fangled re-invention of the pre-existing since the 1950's technology called "serverless" or "lambdas" (which is not really serverless at all, that is just a cutesy-pie name for the entertainment of the kiddies).

In this scenario, the so-called "server process" is not running all the time, but only on demand. Therefore, it is always fired with a cold-cache and there will be zero improvement possible by increasing application level cache size.

The long and the short of it is that if the "process" you are trying to speed up is fired up from the cold, there is nothing you can do except to "stay warm" rather than shutdown between uses. The performance of the cold-start will not be able to be improved. Like all I/O, the fastest way to do it is to not do it at all.

(11) By Simon Willison (simonw) on 2021-07-19 02:32:00 in reply to 10 [link] [source]

Cloud Run stays "warm" in between requests, so in practice I see a few seconds delay the first time I visit one of my applications but then everything loads quickly (without the cold start) for subsequent requests.

This is a great fit for my stuff that works on top of SQLite, since I'm building interfaces for people to explore a database - so they're likely to continue running different queries against the same tables for a period of several minutes, ideally while those stay in RAM.

On "serverless" - completely agree, it's a terrible name. The definition I've chosen to embrace for it is "scale to zero", which means I don't have to spend any money on hosting my applications unless they are actively serving traffic. Since I have over 100 side-projects running at any given time this characteristic is extremely useful!

(12) By ThanksRyan on 2021-07-19 15:22:33 in reply to 11 [link] [source]

...which means I don't have to spend any money on hosting my applications unless they are actively serving traffic.

Do you have a blog post on how you've managed this?

...

unrelated to this thread (mostly)

https://news.ycombinator.com/item?id=27871574

You may want to check out Visidata and see how the data is loaded. I doubt it's being imported into a SQLite database, but it's python code that can load csv files fast or at least in the background. https://www.visidata.org

(13) By andse-t610 on 2021-07-19 19:18:57 in reply to 9 [link] [source]

If the underlying database file is updated by a different connection, the application page cache is discarded.

After db update, arr all caches in all other connections are discarded? Or only some dirty pages?

(14) By andse-t610 on 2021-07-19 20:43:14 in reply to 1.2 [link] [source]

I have the similar situation in my python web api app: many read, few write.

Here are my observations.

Page cache is private by default, so it lives only while connection is alive. So if your application opens a new connection for each new request, page cache won't help much, because each connection will start "cold" with an empty cache and read data from disk.

I've tested 3 approaches to speed up in such case:

use connection pool and private cache. Slightly worse then next variant in my tests.
use shared cache for all connections. At app start you create "master connection" to ensure the shared cache alive all app run time. So next db connections will not start "cold", but "warm" using previous cache data.

But keep in mind that shared cache is not recommended to use and in general it is intended for other cases. But it really works for me, sorry for the "harmful advice" :-)

In the extreme case (page cache can contain entire db and all db data loaded into cache) it becomes in-memory db with disk-sync.

So, the third approach:

use in-memory db if you have enough RAM, few writes and can reduce the requirements for persistence when app fails.

At app start create "master connection" to file::memory:?cache=shared, load data from disk using backup api. Next your app uses this in-memory db when read. On write you have problems - you have to write into in-memory db and disk db. I haven't done this part yet, but you can use triggers + attached disk-db (so you have single transaction, but only if your disk-db not in WAL-mode - else you have separate transaction for each db), or just write python code and hope that both db will be successfully updated (if not - you can just reload in-memory db from disk - and let the whole world wait...).

I have tested my app in read-only scenario with these modifications under heavy load - shared cache is faster, in memory-db is several times faster.

Since 3.36.0 you can create shared in-memory db in another way - file:/memdbname?vfs=memdb. Source code commit says this mode doesn't use shared cache. But i have not tested it vs :memory: yet. @drh advices use memdb, but the question is not explained in the docs yet.

By the way, there are big problems with updating the sqlite version of the python sqlite3 module. I can advice you pysqlite3 module - it allows you to compile a standalone python module using sqlite3 amalgamation of any required version. The interface is similar to the standard python sqlite3 module. It is hardly googled because shadowed by pysqlite repo, which is deprecated after including it's code in the python standard library.

P.S. Excuse me for my english

(15) By Roger Binns (rogerbinns) on 2021-07-19 22:58:19 in reply to 1.2 [link] [source]

As others mentioned you can use PRAGMA mmap but that will only cover the first 2GB of the database (a 32 bit limitation even on 64 bit platforms imposed by SQLite itself).

The Python GIL is released while SQLite code is running, but SQLite itself then has a per database handle mutex so you won't get concurrency on the SQLite side unless you open the database multiple times. Consequently you'll get best performance if you open the database separately in each thread. (Note: the sqlite3 module does not support cross thread usage of the same db handle.)

The number of different query strings you have will also matter. Under the hood the text query strings get turned into C level sqlite3_stmt which is what does the actual query. The cached_statements parameter to connect controls a cache of text strings to sqlite3_stmt, defaulting to 100. If you have more than 100 then things will go slower (the text to sqlite3_stmt conversion is quite slow).

Beyond this, performance will depend on the usual profiling, your schema, indices etc.

I did implement a tracer in APSW that runs against your unmodified program and can show all queries summarised by popularity, execution time etc - https://rogerbinns.github.io/apsw/execution.html#apsw-trace

(16.1) By Simon Willison (simonw) on 2021-07-19 23:54:48 edited from 16.0 in reply to 12 [link] [source]

Do you have a blog post on how you've managed this?

Mainly it's just "use Cloud Run" - to use it you need to package up your (stateless) web application in a Docker container which Cloud Run will then start running (and charging for) when requests come in and switch back off when the traffic stops.

I made some notes on how to use it here: https://til.simonwillison.net/cloudrun/ship-dockerfile-to-cloud-run

I usually deploy to it with my datasette publish cloudrun tool though, which is described here: https://docs.datasette.io/en/stable/publish.html#publishing-to-google-cloud-run

(17) By Simon Willison (simonw) on 2021-07-19 23:56:56 in reply to 15 [link] [source]

Thanks very much, this is really useful.