Extend the ZIP vtable

(1) By ddevienne on 2021-06-22 11:41:45 [source]

Hi,

I'm dealing with very large ZIP files (> 300K entries, multi-Gigabytes),
and given how slow ZLib is (50 - 80 MB/s, depending on level and data),
and the fact I'm running on large desktop machine with many cores, I'd like to
parallelize the compression aspects, and leave the raw ZIP IO to the vtable.

On the decompression (read) side, the current vtable already has access to the
uncompressed data via the rawdata column, but not the CRC32 for that entry, to
be used by the outside parallel code to check the uncompressed data.

The compressed size is implicit from the blob returned from the uncompressed data,
although I guess that makes length(rawdata) more expensive that it should be, no?

On the compression (write) side, both the uncompressed size sz and rawdata must be
NULL for now, while in my case I'd like them to not be (and instead have data NULL).
And there's again no CRC32 column exposed, to write it directly (computed outside, in parallel, like the rawdata).

If I were to change zipfile.c to expose the crc and compressed-size columns (e.g. szz),
and allow writing (sz, rawdata, crc), would such a contribution have a chance to be incorporated?

Regarding length(data) (which sz replaces) and length(rawdata) (no equivalent column for now),
should we override the length() for that vtable? Is there even a way to know length() is called
on one of our special virtual columns, to answer the question directly, w/o actually reading and allocating
those blob values, when all one wants is their lengths?

Thanks, --DD