Thousands of SQLite Databases for each user

(1) By anonymous on 2021-03-24 17:47:36 [link] [source]

I am planning on giving each user their own SQLite DB, rather than storing everything on one large database (e.g. MySQL/Postgres).

I see Internet threads online advising against this because it makes it harder to horizontally scale.

Has anyone had experience with something like this? I don't see why it would be hard to scale as it is just singular files that can be easily replicated.

I am currently storing everything on individual flat files, for thousands of users on my site and looking to use something more reliable - postgres just seems like overkill.

(2) By jose isaias cabrera (jicman) on 2021-03-24 20:16:11 in reply to 1 [link] [source]

Has anyone had experience with something like this? I don't see why it would be hard to scale as it is just singular files that can be easily replicated.

So, I am thinking that you would have some SharedDB for all users to keep track of sharable data, etc. If it's only reads, it should be ok, if you are talking about reading and writing, that would be tough. Back in 2006 I started a project tracking program with 11 PMs, each with their own DB, plus one shared DB that would update after every PM save. After multiple broken SharedDB, and lots of code changes checking for DB locks, it worked. But, that was only 11 users. And I should say that, sometimes, they would try to write at the same time and the write would time out, so it will try again. PRAGMA busy_timeout will be your friend. Again, that was only with 11 PMs. 1000+ users? I don't think it will work. Imagine if 100 of them write to the DB at the same time. Anyway, I may have all of this wrong, but that is the input I can give. IHTH.

josé

(4) By anonymous on 2021-03-25 07:34:47 in reply to 2 [link] [source]

Thanks for the reply!

To give context, my site is somewhat of a social network (web app with social features).

Each database would be unique to the user and they don't interact with each other.

Essentially, each user has their own directory (e.g. /var/www/html/data/{unique_id})

That unique folder would contain everything needed for that specific user (one SQLite database + images - profile picture, etc), as opposed to a Shared Database.

There is one shared DB, but that is more for minor things and probably gets written to like < 100 times a day. It doesn't contain the main stuff.

When a user logs into my site, any writes would be made to their unique SQLite database. However, their content may be read by thousands of others from that same database. So the write/read ratio would be something like 1/1000.

Tldr; Each of the databases are mostly only written to by the user (so not that frequent), but can read from thousands of times).

In terms of horizontally scaling, I was thinking I could just have multiple servers (under a load balancer), and whenever a database is written to it gets replicated to the others.

Would that work?

(3) By Ryan Smith (cuz) on 2021-03-24 21:39:35 in reply to 1 [source]

Multi-files are perfectly fine and perfectly doable, and it for sure is a bunch better if you already have such a system to have that files rather be SQLite DBs than flat files. I assume you obviously have a central DB/repository for central data? If so, having distributed data in a distributed way is just fine, especially if it works for your use case - but if you realize at some point it is cumbersome, then change it, otherwise keep it.

The internet is full of people who read "Well it's wise to do X in stead of Y" and then promptly on every forum screams at the top of their lungs: "NEVER DO Y, EVER! ARE YOU STOOPID???? YOU WILL ALL SURELY DIE IF YOU DO!".

I take a special pleasure in ignoring them.

Other favorites of mine are people abhorring "premature optimization". I agree that premature optimization is a bad habit, but I seem to disagree very widely with some people on what constitutes "premature". Some people will hide behind this little rule to just NEVER do any optimization, ever. if the optimization opportunity is real, you most definitely SHOULD do it. It's one way to tell if a project is maturing: when the optimizations start getting specific.

You know your use-case better than anyone on the interwebs - including us.

My advice? Try and write down the advantages and disadvantages for any way to do things, and be honest. If you can't think of any, read up online, as widely as possible. Once you have a good honest list, decide what you can and cannot live with or work around long-term, then choose. (And be prepared to have made a mistake and have to change it in future).

(5) By anonymous on 2021-03-25 07:43:18 in reply to 3 [link] [source]

Appreciate the response!

I do have a central DB but that is more for smaller things and probably gets written to (< 100 times a day) - it is mostly used for reads.

Each user would have their own unique SQLite database, stored under their unique directory.

In terms of horizontally scaling (in the future), I was thinking of just having a few identical servers under a load balancer. Since the DBs are mostly for reads, whenever someone writes to an SQLite file it would just get replicated across the other instances - which should be pretty quick I believe since it's just one file for that user.

If it ever turns out to not be that efficient, I would probably just use DynamoDB

Would that work?

(6) By Ryan Smith (cuz) on 2021-03-25 13:45:39 in reply to 5 [link] [source]

See, now that's where it gets sticky.

The question of efficiency is not a problem, it will be very efficient (probably moreso than other DBs) given a few conditions.

If you load-balance, it means either you have multiple processes or multiple machines, so if data is on one machine but not the other (if a file) then it can't really work, so are you duplicating the data? Would one server process ever need to access more than one of the files?

Maybe it is that you access the data via a common drive/network attached storage? If so, SQLite isn't suitable for that sort of environment (well, not for a flaw of SQLite, but for a flaw in how network file-systems lie about locking - there's tons of literature on this, both on the forum and elsewhere, so I won't rehash it here). See: sqlite.org/whentouse.html

You really need to have one server with one direct storage under control of the same OS that is hosting the process that is using SQLite. You may expand to other drives using logic volumes though - same OS, so no problems there.

Some of the above may apply, you have to decide which to know if the solution is perfect or not.

(7) By Dan Shearer (danshearer) on 2021-03-25 14:03:50 in reply to 1 [link] [source]

On 2021-03-24 17:47:36 anonymous said:

SQLite is not designed for large multi-user distributed applications. But you have described a use case that might match SQLite's sweet spot.

The word "scaling" is often used lightly, like the word "terrabyte". The magical runeword "horizontal scaling" does not by itself make giant workloads function well. That's why you need to look very hard at what is really meant here, and if you in fact are ever likely to have a giant workload. Not many do.

I see Internet threads online advising against this because it makes it harder to horizontally scale.

The scaling for any one mostly read-only SQLite database is firstly vertical, and horizontal doesn't tend to arise for a long time if ever. Consider the number of concurrent reads you can get from an SQLite node before you even need to feed it more CPU and RAM, and then up until the maximum amount of RAM and CPU it is possible to give that node these days. It's such an enormous number of hits that if you still need to look at horizontal scaling then you probably have lots of ways of paying for it. Horizontal scaling certainly is possible, but by comparison I have never heard of a way of having a single Postgres node coming within one zero of the hitrate of SQLite, because Postgres does different things. From the limited information you provide, the use case is a good match for SQLite.

As a subset of this, it is conceivable that your problem with SQLite becomes that the relatively few writers briefly but noticeably block some of the vast number of readers even with WAL2. That is an area I'm looking at where SQLite is more simplistic than is probably necessary, but even with SQLite as-is there are things you can do to address this.

The requirements and design you have sketched out in the original post plus followups suggest that, even if load is not evenly distributed across each SQLite database, you can still serve billions of SQL queries in total per day (I get that guess by multiplying "thousands of databases" by a conservatively low number of requests per database.) There are assumptions in here such as no merging of data between nodes and no pathological queries etc.

I don't see why it would be hard to scale

From the information you gave, including that it works with plain files today, probably not. As food for thought, the limiting case of plain files and nearly 100% readers is email, where Maildir and similar systems will scale to very large numbers because there are so few moving parts (or, a short codepath.) I'm not saying that you are better off with plain files, but installing a networked SQL database may decrease your scaling by introducing so many more moving parts. SQLite has a short codepath, which is part of why it can serve so many requests per second.

as it is just singular files that can be easily replicated.

I suspect that may be an incorrect scaling dimension to be worried about from the information you have provided. "Copying single files" suggests that a given SQLite node is having too many reads to cope, which as I have said is a very high number. It also suggests you will need a loadbalancer/director to switch across nodes, which you may not need in the "thousands of SQLite databases" model although you haven't given much to work with. So I suggest that until you have nodes getting to that very high usage, you shouldn't need to be copying any files.

Dan Shearer