Breaking up unrelated database into multiple databases

(1) By biodrool on 2022-08-29 02:20:50 [link] [source]

I have around 300GB of data in a SQLite database that is literally unrelated; When people query it, they just filter to select the rows they need, they never perform aggregations or windows of any kind. The database is read only, updated in batch jobs once per year.

This got me thinking, what if I broke it up into a hundred or so tiny databases so that you only connect to the ones you care about, staying well below 1 GB. To make this work, there would need to be one shared database that stores a teeny tiny hundred-row table, telling which database to connect to for access to certain data.

My questions:

I know there's limits to how many databases you can attach, but in this case, the client application would have entirely separate connections to maybe a hundred different databases. So would the overhead of the database connection present a bottleneck?
Could multi-threading be used to run concurrent queries faster since by each thread accessing only a tiny database they are not having to filter through as many rows as the one giant database table?
Is the design of multiple read-only databases instead of 1 giant read only database commonly done for these sorts of use cases?

(2) By Chris Locke (chrisjlocke1) on 2022-08-29 07:51:19 in reply to 1 [link] [source]

A database is designed for a million rows. If you've got the appropriate index, searching a 3 row table for a value should be the same for searching a million row table.

The only thing you're going to give yourself is more of a headache! ;)

I can understand 300 GB of data is probably getting unmanageable - ages to backup, move around, etc, and that would be the only benefit of slicing up the database. I think slicing it into 300 chunks is more overkill though - you've gone from one extreme to the other.

(4) By ddevienne on 2022-08-29 08:06:30 in reply to 2 [link] [source]

Let's say that 300TB DB has 300B rows, so basically 1B rows.
Let's further assume assume you shard that DB into 300x 1GB DBs.
Let's consider that B-Tree accesses are proportional to log2(#rows).
log2(1e9) is in the 30's, while log2(1e9/300) is more in the 20's.
So in these grossly simplifying assumptions, you can expect a small performance boost when using sharding, on DB sizes alone.

Of course, you pay for that with code-complexity. It's a tradeoff, as always.

But that's ignoring the concurrency aspects, if you need concurrent read and write. A read-only DB for its users, often still needs be written / updated by some process, to keep it up-to-date. If that happens concurrently, sharding can sometimes reduce contention. FWIW :)

(3) By ddevienne on 2022-08-29 07:56:48 in reply to 1 [link] [source]

Perhaps you'll be interested in this virtual table? https://www.sqlite.org/src/file?name=ext/misc/unionvtab.c&ci=trunk

Yes, hundreds of different DB connections is fine. There is a small overhead to each, especially to cache the schemas, but most users shouldn't worry about this, unless in specific resource-constrained contexes.

Yes, as long as each connection is accessed from a single thread, and especially is those connections access different DB files (or different shards of your big logical DB), then multi-threading will improve both concurrency and performance. But you have to be extra-careful of course, with anything multi-threaded...

Note there are features which if activated limit concurrency, like tracking memory use, since SQLite must lock a global mutex to update the value often, thus becoming a bottleneck.

I'm not sure if that design is common, but we've had a commercial server product using that pattern for years. 2 or 3 main DBs per-dataset, with up to thousands of mostly-blob-populated bulk DBs, depending on the dataset. The main DBs reach low GB sizes per-database (most less than 1GB), while the sum of all DBs for a given dataset can go upwards of 100's of GBs (and there can be thousands of such datasets). And that's just the data that goes in SQLite, the really large data stays in files, which sometimes reach into the TBs.

(6.2) By biodrool on 2022-08-29 13:35:01 edited from 6.1 in reply to 3 [source]

Yes, as long as each connection is accessed from a single thread

Appreciate that clarification. Out of laziness I was about to group some dbs as attachments, but now I will keep each DB accessed from a separate connection to limit possibility of multiple threads accessing the same connection.

So I will have <100 connections, each reading from the big db with a little db attached they are writing into. The actual number of connections I will tune based on feedback also given in this forum thread.

(5.1) By biodrool on 2022-08-29 14:01:28 edited from 5.0 in reply to 1 [link] [source]

you are all super helpful, thank you!

And you all read my mind:

The performance improvement to reading isn't quite as important, you're right. The maintenance/backups IS!
The concurrent writing I completely forgot about, it is very important. Currently we read this database in order to write it to a new SQLite database for reporting. And doing so into just 1 table is of course bottlenecked by each serial commit, so sharding in this case I completely forgot would also have huge advantages for concurrent writing.
If coding complexity is the only downside to this design (I don't use any concurrency limiting features), then it sounds like I'm headed in the right direction! I am using Rust/Rusqlite, which has elegant multi threading support. There will of course be a tiny global table storing the mapping of shard_id -> db files in some global table that you must first lookup to know from which connection to run your query.

Actually, what probably makes the most sense is to use a server database, like postgres with partitioning for a performance boost. Then, during development I concurrently read/write and check things and once ready write everything to SQLite for customer release. But until then I'll experiment with this sharding idea. Thank you!