ZIPVFS questions

(1) By Pepijn Van Eeckhoudt (pepijnve) on 2021-09-10 16:02:11 [link] [source]

I'm working on a project where I'm thinking of using SQLite to store large read-only, indexed representations of my raw data. I'm going to shard this across multiple SQLite files to keep the single file size under control, but the total projected required disk capacity I'm going to need is outgrowing the storage budget I have available.

I've been looking into various ways to compress the database files. Since they will be produced by a batch job and be read-only from that point forward zipvfs seems like a good candidate. It would be nice to be able to try this out to see what the impact on query performance is with various compression algorithms. Are there any comparative benchmark numbers available for this? Or is it possible to arrange some kind of evaluation of this extension before I go ahead and purchase a license?

The license for zipvfs states that the extension needs to be statically linked into the application. My code is all Java though that loads SQLite as a dynamic library and uses it via libffi. Does that preclude using zipvfs?

(2) By ddevienne on 2021-09-10 16:39:55 in reply to 1 [link] [source]

Do you really need zipvfs? You talk about your storage budget, but what's you RAM budget? Compared to your storage budget?

I've just in the past few days ZIP'd up several SQLite DBs, to make them as small as possible
(also VACUUM'd and optimized the page_size for minimum size) so I can check them in to source control,
for testing, and dynamically decompress them using https://www.sqlite.org/zipfile.html at runtime,
and load the blobs as in-memory DBs using https://www.sqlite.org/c3ref/deserialize.html,
which sounds a little bit like your use case.

Worked like a charm for me, but I'm in C++, not Java, and it's all linked statically.

Only issue I can foresee, if that this might waste more RAM than zipvfs, but not having used the latter, I can't say for sure.

(3) By Pepijn Van Eeckhoudt (pepijnve) on 2021-09-10 16:46:44 in reply to 2 [link] [source]

On the low end I'm dealing with 40-50GB, but our high end projection is multiple TB. If that much disk space is deemed problematic I doubt that much memory will pass :D.

(4) By ddevienne on 2021-09-10 17:16:21 in reply to 3 [link] [source]

Right. That's 3 orders of magnitude larger than my case :)

I packed 4 DBs totalling 5.4MB into a <900KB zip file.

In your case, it depends how big each DB inside the ZIP would be.
And how many of those DBs you need to keep in memory at any one time.

If OTOH what is big in inside single-cell-values, you could use https://www.sqlite.org/src/file/ext/misc/compress.c instead, similar
to how https://sqlite.org/sqlar/doc/trunk/README.md emulates / replicates a ZIP.

But it could well be that zipvfs is better for you indeed.
And going commercial, you then have the full support of DRH and Dan! :)

(5) By mlin (dnamlin) on 2021-09-10 18:00:01 in reply to 1 [link] [source]

I'd welcome any feedback on sqlite_zstd_vfs if that might be suitable. It has a number of disadvantages compared to ZIPVFS discussed there, some advantages (Apache-licensed; uses background threads), and I'm slowly improving it as time permits.

I've used it from JVM apps by instructing SQLite to load the extension packaged in its own dynamic library.

(6) By Richard Hipp (drh) on 2021-09-10 18:24:39 in reply to 1 [link] [source]

If the databases truly are read-only, have you considered CEROD instead of ZIPVFS?

(7) By Pepijn Van Eeckhoudt (pepijnve) on 2021-09-10 18:32:41 in reply to 6 [source]

Whatever works best is fine by me; it doesn't have to be zipvfs. The documentation on CEROD is rather limited so it was a bit harder for me get a feel for how it works, what the tradeoffs might be, etc.

One concern I had with CEROD was that it runs as a post processing step. That will probably provide better performance than on-the-fly compression since you know about all the data that's coming up front, but it will increase my peak disk capacity requirements further.

An attractive feature of zipvfs (if I understood it correctly) is the plugability of compression algorithms. Being able to experiment with various algorithms (snappy, lz4, deflate, ...) to see what gives us the best results would be nice.

I would happy to give both a try if that's possible so I can pick the most appropriate solution. I can go into more details on what I'm building if that helps, but that would have to be in a more private conversation. I don't think my employer would appreciate me sharing too many details just yet.

(8) By Pepijn Van Eeckhoudt (pepijnve) on 2021-09-11 07:01:13 in reply to 6 [link] [source]

Richard, what’s the commercial/sales communication channel for these products? https://sqlite.org/prosupport.html seems to point to this forum, but that’s probably not the right place to have this kind of discussion since I’m limited in how much information I can share in public.