Extend the ZIP vtable
(1) By ddevienne on 2021-06-22 11:41:45 [source]
I'm dealing with very large ZIP files (> 300K entries, multi-Gigabytes),
and given how slow ZLib is (50 - 80 MB/s, depending on level and data),
and the fact I'm running on large desktop machine with many cores, I'd like to
parallelize the compression aspects, and leave the raw ZIP IO to the vtable.
On the decompression (read) side, the current vtable already has access to the
uncompressed data via the
rawdata column, but not the CRC32 for that entry, to
be used by the outside parallel code to check the uncompressed data.
The compressed size is implicit from the blob returned from the uncompressed data,
although I guess that makes
length(rawdata) more expensive that it should be, no?
On the compression (write) side, both the uncompressed size
rawdata must be
NULL for now, while in my case I'd like them to not be (and instead have
And there's again no CRC32 column exposed, to write it directly (computed outside, in parallel, like the
If I were to change zipfile.c to expose the
crc and compressed-size columns (e.g.
and allow writing
(sz, rawdata, crc), would such a contribution have a chance to be incorporated?
sz replaces) and
length(rawdata) (no equivalent column for now),
should we override the
length() for that vtable? Is there even a way to know
length() is called
on one of our special virtual columns, to answer the question directly, w/o actually reading and allocating
those blob values, when all one wants is their lengths?