Flat files vs SQLite

(1) By iio7 (iio7iio7) on 2024-01-06 01:02:59 [link] [source]

Hi,

I am building a web based application in which in each user has a few options stored.

Currently, in order to avoid complexity, I am using JSON files for storing user options.

When the user logs in, his file is read, if he changes a setting, the file is written. Other than that, the file is not used.

There is no search involved in anything. Each user file is fetched when the user logs in based upon a part of his login ID (generated by the application), as such, fopen fetches the file directly without searching.

In order to avoid too many user files residing in a single directory, a structure like: a/b/f/foo.json is utilized.

Now, I am wondering if there would be any benefit in using SQLite for this rather than JSON files?

I do not need transactions. I am running on ZFS so I have checksumming on user files.

What I am dealing with of complexity is related to the JSON parser itself. If I introduce SQLite I not only get the complexity of the database software itself, but also the SQLite driver for my application programming language.

Occasionally I do need to use grep to look for things in the JSON files, but they reside on SSDs so it's super fast even with millions of files. Of course not any where near the speed of search an index with SQLite (I have tested both), but fast enough.

What benefits of utilizing SQLite am I possibly overlooking?

Kind regards

(2) By Larry Brasfield (larrybr) on 2024-01-06 02:05:37 in reply to 1 [link] [source]

Have you considered what happens to the data when a power failure or other unavoidable interruption occurs during a modification of one of the JSON files?

(3) By iio7 (iio7iio7) on 2024-01-06 03:10:20 in reply to 2 [link] [source]

Yes. Since I am running on ZFS, it has copy-on-write, intention logging, and transactions with rollback.

ZFS operates like a transactional database management system in that old data is not overwritten in place when being updated, as with traditional filesystems. Instead, the new data is written elsewhere on the disk, then the filesystem metadata structures are updated to point to the new data, and only then is the old data's block freed for reuse by the filesystem. In this way, a sudden power loss will leave the old copy of the data in place if the new data updates are not 100% committed to persistent storage. You won't have half the block replaced or anything like that, causing data corruption.

(4) By luuk on 2024-01-06 10:26:45 in reply to 1 [link] [source]

"In order to avoid too many user files residing in a single directory"

Solved when using SQLite

"complexity is related to the JSON parser itself"

Solved when using SQLite (and please do NOT store JSON object in the database 😉)

"I not only get the complexity of the database software itself....."

I think, in the end, using SQLite is easier to maintain than those files...

Another benefit is that you can compare the setting for individual users easier. (Using an SQL statement, and not comparing thousands of files)

"but they reside on SSDs so it's super fast even with millions of files"

A SQLite database, with millions of records is not a problem....

(5) By punkish on 2024-01-06 10:45:24 in reply to 1 [link] [source]

I can offer my experience (since it is similar to your situation) and let you decide for yourself. I use SQLite for all my data needs to power a REST API to query the db that, in turn, powers other websites.

However, for caching, I use plain text JSON files stored in a hierarchical directory structure 0/01/019/019d73e1960e22dbcd8bebd155a7156b.json where the JSON file name is a hash constructed from the query URL whose params have been sorted for predictability.

Many caching systems use SQLite, so why don't I do that as well? A lot of queries involve many JOINs and ORDER BYs that makes them slower than I would like. Using the URL to construct the cacheKey and checking if that JSON already exists and sending it back if it does is waaaay faster as rest of the app is completely bypassed. Like your usecase, I am mostly reading but also writing the files (if a cached result doesn't exist, then SQLite queries the db, the JSON is written to the cache, and sent back).

So yes, there are advantages to using SQLite (you can compare or query the stored values -- the settings, in your case, or the cached query results in my case). But I don't need to ever compare two separate queries, and I am not bothered by all the JSON files inside my cache directory. Besides, if I ever want to just look at the results of a query, I can look at the cached JSON file using a text editor which I find more convenient than SQLite's CLI.

You can decide which method you want -- both work with their own positives and negatives.

(6) By Simon Slavin (slavin) on 2024-01-06 12:25:44 in reply to 1 [link] [source]

I don't think anyone mentioned space overhead. Lots of little files means wasted space at the end of each file, space used for directories, directory entires, and file metadata (every file has protection settings, last access dates, etc.), and endless allocation and disposal of file handles, which means processing and memory allocations and disposal.

One big SQLite database means one file handle, one database handle, one file entry, one piece of file metadata, and no shuffling of memory allocations.

Also, it'll take up far less space on your backup medium, and can be backed up in one single process which either succeeds or fails. No need to detect and analyze backups which failed partway through the process.

(7) By punkish on 2024-01-06 16:22:19 in reply to 6 [link] [source]

The contrarian argument is that even the tiniest change to the SQLite db means the entire one big SQLite db file will have to be backed up. Change a few bytes and potentially backup the a 16 GB db file.

With lots of little files, if one file changes, only that file is backed up.

There is always the "other side" to the argument :)

Oh, and who cares about lots of files! (unless you have space constraints, of course). As long as you don't have an excessive number of files in one directory (and that is ensured with the aforementioned hierarchical directory structure), the filesystem is wonderfully fast and efficient at managing them.

(8) By iio7 (iio7iio7) on 2024-01-06 20:13:02 in reply to 5 [link] [source]

Thank you for sharing. This is exactly what I am doing, the JSON file name is a hash constructed from the query URL.

(9) By anonymous on 2024-01-07 13:37:55 in reply to 7 [link] [source]

Couldn't you use sqldiff or some similar tool to discover the incremental changes to a database (holding files and file data) and then back up just those changed bytes? It seems to me that this could potentially result in a smaller amount of data to back up at any time versus backing up entire files when they are modified? In other words, granularity of data matters.

A counter argument, of course, is that you could construct a file-based backup solution that operates incrementally, i.e. units of bytes instead of as an entire file. (The z file system, which provides data deduplication comes to mind, here.)

I imagine there are pros and cons to either approach... which means that there's probably no single "best" answer.

(10) By punkish on 2024-01-07 17:02:46 in reply to 9 [link] [source]

Couldn't you use sqldiff or some similar tool to discover the incremental changes to a database (holding files and file data) and then back up just those changed bytes? It seems to me that this could potentially result in a smaller amount of data to back up at any time versus backing up entire files when they are modified? In other words, granularity of data matters.

I have no idea how sqldiff works, but since a SQLite db is a single file, how would backing a "few changed bytes" be possible? You either back up the whole file or you don't. The backup program looks at the file and considers it different if it has changed since the last backup -- it doesn't care if the change was tiny or huge.

The JSON file we are talking about in this thread are tiny. Backing up individual files is so much quicker than a multi-gigabytes file (at least, in my case)

(11) By Spindrift (spindrift) on 2024-01-07 17:12:25 in reply to 10 [source]

I have no idea how sqldiff works

It generates SQL such as needed to ensure database a is converted into database b (with some caveats that are likely not relevant here).

So it is pertinent to the question, and could be used as a incremental backup flow.

My take on the whole question though is that the OP has a working system. Changing it for the sake of it would seem foolish. SQLite could certainly cover this use case, but unless the OP is actually experiencing some problem or other, I would personally leave things well alone.

(12) By ralf (ralfbertling) on 2024-01-08 17:18:45 in reply to 11 [link] [source]

There are theoretical and/or practical (dis)advantages.

As the OP has done some measurements, the differences in time can be multiplied with the number of logins to get the possible gains / losses for each technique.

If the effects are insignificant and you neither have nor expect problems or significant changes, you can stop right here.* However, the expect part is discussed below in slightly more detail.

A possible disadvantage

SQLite is single-writer, so changes might need to be queued. Of cource the users could be put into buckets based on an id-hashcode, but then you also lose some of the benefits.

If the database is locked the update-SQL could be put into a file that is processed afterwards.

Incremental backups

Writing the INSERT/UPDATE/DELETE-Statements into Text files could also be used as an incremental backup, instead of keeping two copies of the SQLite-DB around and do a complete diff. (If you use multiple files, you need to maintain the order of changes or organize the files in a way that prevents dirty writes.

Once the changes exceed a certain threshold, a complete backup can be taken.

A file-system can be thought of as a database.

You perform similar operations and need some API to persist data. This again, can serve as a metric to evaluate different techniques.

The power of SQL

If the data model changes (and one could argue "If" should be replaced with "When" in that part of the sentence) views and triggers can serve as an abstraction layer that isn't available with plain JSON. Also, as you mentioned scripting some report across thousands of files is not just slower in execution, it is also more work to write and more prone to (human) error.

Do you like options?

If your requirements change, even a modern filesystem like ZFS might reach its limits in scalability. As SQLite is single-writer, that might also be a limit. But if your data exists in a database, you have the option to move to a server-based database like PostgreSQL or MySQL. (That will require some work, escpecially if you embrace data type affinity (that IMHO is both a blessing and a curse), but much less then moving from a complex JSON-folder-structure to a db-server.)

(13) By Warren Young (wyoung) on 2024-01-08 20:23:13 in reply to 12 [link] [source]

ZFS might reach its limits in scalability

Unlikely.

(14) By ralf (ralfbertling) on 2024-01-09 12:30:20 in reply to 13 [link] [source]

This ist increasingly off topic, but I agree my wording didn't make my point clear.

The point I was trying to make that a machine running ZFS might show worse performance when changing millions of small files per second. That is one of the reasons for database-servers to exists.

It is a different axis on scalablity than what the article describes. Of cource the database server will pass the load on to the file system. In a database you have the choice to store a lot less metadata along with some user setting and group that data into a sensible number of files.

(Last time I checked ZFS also wasn't a network file system to build large clusters of multiple servers. So things might get a lot more complicated once you exceed the amount of storage that can be handled by a single machine. Databases are built to scale from a small sqlite-file to a cluster of servers working together.)

I don't mean to say anything against ZFS. The point I was and am trying to make is that an additional abstraction layer on top of the file system might be helpful for certain definitions of scalability.

As I understood the OP's question, they were asking on potential advantages of using SQLite in partticular, so I included a general point why one might want to use a database in general.

(15) By anonymous on 2024-01-10 12:45:56 in reply to 14 [link] [source]

Ah. If you want an abstraction on top of a filesystem to query file names (and other things), would something like osquery's file virtual table be of any use? (https://www.osquery.io/schema/5.10.2/#file)

This would give you easy access to file names using sqlite machinery but I'm not aware of any virtual table that would give you access to file contents. But, it's been a while since I looked at osquery in detail. I'm probably lagging (far) behind.

(16) By anonymous on 2024-01-10 14:41:45 in reply to 15 [link] [source]

Yeah... I am lagging behind. After poking around a bit, I see the 'yara' tables in osquery can give you access to (patterns in) a file's content.

(https://github.com/osquery/osquery/blob/master/docs/wiki/deployment/yara.md)