Multiple compression methods

(1) By Yair lenga (Yairlenga) on 2022-06-15 23:02:06 [link] [source]

There are multiple compression nigines available, which produce superior compression for various use cases. Examples include Googles brotli, and Facebook a standard. New zip engines allow compression. Using those methods - they keep the compression engine names in the dictionary.

There are few SQLite extension to support those compressions method. Each one is building its own command line api, etc. I think it will be better to extend sqlar to take an extra argument, compression engine, and use it to compress /uncompressed the data. This will allow new engines to be created, just by supporting the compress/decompress methods.

As an extension, will be nice to allow passing compression level. Most engines,, including compression engines provides mutilple levels - fast/medium/slow to control tradeoff between speed, size.

Looking for feedback - potential help with implementation. I have c sperience, but Lino experience contributing code to SQLite.

See [zstd-sqlite] https://github.com/phiresky/sqlite-zstd And [brotli-sqlite] https://github.com/itroot/sqlite-brotli-compress

Yair

(2) By Stephan Beal (stephan) on 2022-06-15 23:32:50 in reply to 1 [source]

There are multiple compression nigines available, which produce superior compression for various use cases. ... Looking for feedback

The most significant issue with every one of those libraries, from the perspective of getting support added to 3rd-party software like sqlar, is that they are not ubiquitous. zlib is preinstalled (or easily installed) everywhere. It is the de facto standard for compression libraries. Availability often trumps technical superiority.

I think it will be better to extend sqlar to take an extra argument, compression engine, and use it to compress /uncompressed the data.

That would inherently make those archives unportable to any machine which doesn't have the corresponding engine installed (recall that zlib is the only one which is readily available everywhere). Archives "really need" to be portable to arbitrary users' environments. For the sake of portability, the "least common denominator" (in this case zlib) is generally the preferable choice even if it's not technically the most superior choice.

If absolute best compression is your goal, a single-purpose container format like tar will probably beat sqlar every time.

(3) By ddevienne on 2022-06-16 07:45:03 in reply to 2 [link] [source]

But do note that same compression engines like LZ4 have been there a while,
are well known and with different tradeofs compared to ZLib (LZ4 favors speed
at the expense of less compression), and are available as an amalgamation
for easy embedding directly in the source code as a 3rd party, instead of
relying on system dependencies. Even PostgreSQL now supports LZ4 for Toast
(since v14 I believe), which does speak in LZ4's favor.

And LZ4 embeds XXHash, which is excellent and very fast hash-algorithm,
superior to CRC32 (and even faster I believe).

So while I agree sqlar's value would be diminished if not all users can
interop their archives across systems and builds of sqlar, I do think ZLib's
slow compression and even slower decompression (compared to LZ4) would make it
worthwhile to have a different built-in speed-vs-compression tradeof with LZ4.

FWIW :). I know full well that's unlikely to happen of course. Just saying though.

We all know SQLite Core is not open to contributions.

The situation in official extensions (like sqlar) is less clear though.
I suspect the community would be willing to do some of the things proposed here,
but w/o clearer signs as to the advent of such extension contributions,
nothing happens. Except in little known repos on Github, as forks, which thus can't get traction.

(4) By Yair lenga (Yairlenga) on 2022-06-16 23:07:32 in reply to 3 [link] [source]

Thanks for feedback. I believe possible to extend sqlar in such a way that will allow veryone to use zip (and potentially some other embeddeable engines as lz4, if there will be a need). The key thing to the extension is that users with appropriate plugins (or with special cases that will benefit from specific compassion) Will be able to leverage the sqlar infrastructure (command line, integration with SQLite). I will work on more detailed proposal, and share.

For SQLite instances embedded in script engines (python, Perl, LUA), this will open allow sqlar to be used with compression supported by the native scripts. Lot of flexibility.

Is this th right forum, or should I post to sqlar forum ?

(5) By Stephan Beal (stephan) on 2022-06-17 01:31:32 in reply to 4 [link] [source]

Is this th right forum, or should I post to sqlar forum ?

This is the right place for the sqlite3 C library and its closely-related tools (like sqlar).

(7) By Yair lenga (Yairlenga) on 2022-06-17 07:47:27 in reply to 2 [link] [source]

See my comments for ddevienne for comments about portability.

Wanted to highlight why sqlar provides unique advantages vs other archivers:

Vs tar, tar has to read the whole archive to extract a single (or few) files.
Vs zip, zip has to read the whole (central) directory to extract a single (or few) files.

For the use cases f large number of relatively small items, sqlar outperform both tools, as an archiver., not even counting all the other flexibilities built into it.

Regarding portability: many applications register application specific hooks (SQLite functions) into SQLite (mine is using Perl), leveraging the use case of SQLite as application storage. Each of those USE CASES cases make SQLite data technically “non portable”, in the sense the the data can not be processed without the hosting applications. The solution is to use the right tool for the right task: for long term, portable archivers use zip, otherwise be flexible. IHMO.

Yair.

(6) By Simon Slavin (slavin) on 2022-06-17 07:24:50 in reply to 1 [link] [source]

We've discussed things like this before, so it's worth setting out a few things.

Are you discussing a way of compressing (1) individual fields of a SQLite database (e.g. each BLOB value compressed separately, with no reference to any other BLOB value), (2) each row separately, or (3) the entire database ? The first is easy, but doesn't give as much compression advantage. The second is possible, given understanding of how SQLite works. The third presents problems, since SQLite does not read its database in a serial manner, so is hideously inefficient for most compression algorithms. Unfortunately, unless your individual values are repetitive (e.g. a BLOB represents a large amount of data with lots of individual values in it), only (3) will lead to much saving of space.

Lastly, although SQLite is Open Source, it does not usually take contributions from people outside the development team. The code in SQLite is written by the development team. New code is accompanied by new tests, and runs of old tests to make sure it doesn't break anything old. Comments and tests are written to certain standards.

(8) By Yair lenga (Yairlenga) on 2022-06-17 07:54:36 in reply to 6 [link] [source]

I refer to case 1. In my case, 500k blobs, around 1mb each. Measured saving for my case, 7:1 compression with zip,, more than 10:1 with Brotli. Make big difference. Yair

(9) By Donal Fellows (dkfellows) on 2022-06-17 08:30:24 in reply to 8 [link] [source]

If you have big compressible BLOBs, then you definitely can benefit from compressing them. This is actually the easiest case, as it doesn't need deep cooperation from the DB engine. The first thing I'd think about trying is exporting the compression and decompression algorithms needed as SQLite functions (BLOB→BLOB) since then they could be applied by just updating the right SQL calls. That's pretty easy to get working; I've done similar things with password hashing.

But then I'd have a long careful look at the performance of things, specifically whether putting the compression and decompression there was causing overly long transaction times. If you're only really accessing things single-threaded, that doesn't matter too much, but in multi-threaded or multi-process scenarios this is an important consideration as you want write transactions short and read transactions not too long. How much you need to care depends on your application's needs so I can't be more specific.

If performance requirements mean you can't put the compression and decompression inside the SQL, you just have to handle that outside in your application code.

(10) By anonymous on 2022-06-17 16:30:28 in reply to 9 [link] [source]

To emphasize, my comments are only related to SQLAR - the built in functionality in sqlite to maintain an archive. I'm not suggesting extending sqlite to support compression engine as a core functionality (e.g. Postgresql TOAST), as this will result in much more complex implementation.