deploying 1_000_000 sqlitedb on AWS?

(1) By anonymous on 2021-04-07 00:01:32 [link] [source]

Suppose, hypothetically, you take the view of "one db per user" and need to deploy 1_000_000 *.db files on AWS. What would the correct architecture for such a problem?

The current best solution I have so far is:

store each sqlitedb as a separate file in s3, s3/user_###_###.db
when the user makes a request, cluster checks if db is already pulled locally, if so, route to that machine, if not, pick a machine, have it download s3/user_###_###.db
machine with the relevant db locally handles the requests
after 5 minutes (or some type of LRU), the local db is copied back to s3

This naive strategy clearly has too many flaws to enumerate (but describes the gist).

Is there a best practice guide for deploying millions of sqlite db's on S3? For those with similar experience, any advice ?

(2) By RandomCoder on 2021-04-07 05:04:18 in reply to 1 [link] [source]

I can't speak for the best practice question, but you might consider some questions first:

Are you ok with losing data if the local db isn't ever copied back to S3? It's rare, but machines can die with no notice. A user might apply a change, see it work, and then see it vanish later.

Have you taken into account the S3 API costs? If each of your 1_000_000 users causes 10 updates per day, that's something like $50 in PUT requests per day, not to mention other API calls. Maybe that's an impossible figure for your use case, maybe not.

Do you need a snapshot of the entire state of all of the users? There really isn't a way to get a snapshot of your system as described, so if you need a snapshot of all of the users in a point in time, you'll need quite a bit of extra logic.

Is there ever a scenario where someone might try to get a large database from S3 while it's being updated? S3 is strongly consistent (now), but that doesn't mean it's not possible for some edge cases with multi-part download to grab different parts of a file as it's being updated.

Also, the comment about enumeration reminds me: Your naming strategy for new files will be important. You can't assume that a file that doesn't exist on S3 won't exist by the time you try to upload it.

I do in fact store tens of thousands of SQLite databases in S3. Though, they're treated as immutable in my case.

(3) By Simon Willison (simonw) on 2021-04-11 15:05:34 in reply to 1 [source]

Have you investigated EFS at all for this?