Split 1 large db into several smaller dbs

(1.2) By adrian on 2022-10-16 03:21:07 edited from 1.1 [link] [source]

Hello,

I have a very large db, 2+Gb, and it needs to be updated periodically. However, due to bandwidth/deployment issues over thousands of devices, this has proved to be problematic. I was wondering if it would be possible to partition the db into several smaller db files, yet remain searchable over the entire db dataset. This would allow for things like a corruption in one smaller file not destroying the entire download.

Is this doable?

Thx.

(2) By David Raymond (dvdraymond) on 2022-09-28 19:58:18 in reply to 1.1 [source]

I don't believe there's any way for 1 "database" to span multiple files.

Depending on your schema and interdependencies (like foreign keys, triggers, etc) between the tables, you could put a few tables each into different database files, and programs could ATTACH all those separate databases so they're all accessible from the original connection.

If you're putting a fixed snapshot of the database file somewhere for other devices to download, then maybe use the "span disks"/"split to volumes" option on your favorite file compression utility to split up the 1 zip into a bunch of parts that are each no more than X bytes. Then if it splits into .001 through .007 for example, and you got an error downloading .005, you'd just have to retry the download for that one part and not the whole thing.

(5) By adrian on 2022-09-29 14:16:11 in reply to 2 [link] [source]

Depending on your schema and interdependencies (like foreign keys, triggers, etc) between the tables, you could put a few tables each into different database files, and programs could ATTACH all those separate databases so they're all accessible from the original connection.

So, would connecting to all of these files, where a table that spans over them, all act like a single table as far as the API is concerned? Or are you saying that you would put one table in one file and another in another file?

(6) By Stephan Beal (stephan) on 2022-09-29 14:19:35 in reply to 5 [link] [source]

Or are you saying that you would put one table in one file and another in another file?

Exactly. You can have different tables in different ATTACHed files, but you cannot shard a table across files in sqlite3 (barring, perhaps, with some 3rd-party tooling).

(7) By ddevienne on 2022-09-29 14:33:59 in reply to 6 [link] [source]

If you don't care about referential integrity that is...

With (enforced) foreign keys, parent and child tables must be in the same DB.

Also, there's a limit to attach, which defaults to 10 AFAIR, and has a hard limit of 64.

(8) By David Raymond (dvdraymond) on 2022-09-29 14:46:22 in reply to 7 [link] [source]

10 is the default, but the max is 125.

Set with the SQLITE_MAX_ATTACHED compile time option.

(3) By Jason Gibson (Jason_Gibson) on 2022-09-28 21:47:09 in reply to 1.1 [link] [source]

Maybe one of these would help:

https://www.sqlite.org/rbu.html

https://www.sqlite.org/src/file?name=ext/misc/unionvtab.c&ci=trunk

(4) By adrian on 2022-09-29 14:11:06 in reply to 3 [link] [source]

Thanks for the links. Seems we are already using RBU, which I didn't realize, so this helps.

However, I'm curious, I don't know the code, but could there be a performance boost if a db file could be partitioned into smaller ones? Or if no boost, maybe no or minimal degradation? This might help with not having to performing VACUUM as much (after INSERT, DELETE and UPDATE operations) which (from what I read in the RBU docs) is CPU intensive. At least 5x more than non RBU VACUUM.

I'm new here so please forgive my ignorance.

Also, is it possible to generate a SHA for a table as INSERT, DELETE and UPDATE operations are done to it, so that the rows that were already there don't have to be recalculated?

(9) By Simon Slavin (slavin) on 2022-09-30 13:05:09 in reply to 4 [link] [source]

You asked the right questions.

Adding one row to a multi-gigabyte table with a couple of indexes is not significantly slower than adding one row to a small database. Almost all the time SQLite takes is related to operations which concern the entire file/files, like locking the database and updating the header, and adding something to the journal. These operations are the same whether the files are tiny or huge. Shuffling balanced trees, which can take longer for bigger trees, is only a small part of how a SQLite database is updated.

You should not be performing VACUUM routinely unless you routinely delete large proportions (10% ?) of your data. It's a waste of time, dramatically shortens the life of your storage medium, and introduces an opportunity for corruption of data anywhere in the database. SQLite automatically reused freed space. Did someone tell you you should be VACUUMing ? It's normally used only when you need your database to be smallest: sending the whole thing across a communications link, or putting it on removable media.

I don't think any special RBU work would help you here. Vanilla programming and PRAGMA settings are sufficient. You might try with and without WAL mode, but that's about it.

Advantages you would gain in sharding your database would involve things like backing it up. If you have a 2Gig database in one file, adding one row means you need to copy 2Gig of data to back it up. If you have a deluded virus checker which thinks it needs to scan data files, then adding one row to a 2Gig database means it has to scan 2Gig of data. There may be other operations like this. But they are not aspects of how SQLite itself works, they are file management issues.

To put the size of your database, at work I used a database with 43MB of data in one table. Never had any problems with it. The whole of a single-language Wikipedia's text and markup (Images ? Don't know.), by coincidence, fits in a 43Gig SQLite database and is regularly handled in that way in places where people can't depend on good/fast internet connections. Again, no problems.

I don't understand your last paragraph. Rows which are already in a SQLite database don't get updated when other rows change. And I'm not sure how an SHA hash would help anyway.

(11) By adrian on 2022-10-16 03:31:24 in reply to 9 [link] [source]

Thx for the info Simon,

I don't have to worry about locking the db. I was wondering about VACUUMing, and you just verified my understanding.

Backing up the db isn't an issue either, as we maintain 2 copies of the db for redundancy with only one ever written to.

Yeah, don't worry about that. I had my question answered in another post.

Thx again.

(13) By anonymous on 2022-10-16 17:14:28 in reply to 11 [link] [source]

You may want to look at the unionvtab extension in ext/misc source to have an even broader understanding of sharding options.

however, I suggest 'as simple as possible, as complex as necessary'.

(14) By ddevienne on 2022-10-17 07:23:30 in reply to 13 [link] [source]

i.e. https://www.sqlite.org/unionvtab.html

(10) By jaskij on 2022-09-30 18:00:24 in reply to 1.1 [link] [source]

I'm by no means an SQLite expert, but perhaps the session extension would allow you to encapsulate the changes and send only those. Your use case seems to me like a pretty good fit.

(12) By adrian on 2022-10-16 03:32:58 in reply to 10 [link] [source]

Thx jaskij,

Yeah, we're already doing that, but thanks for the info.