SQLite Forum

Extend the ZIP vtable
Login
Hi,

I'm dealing with very large ZIP files (> 300K entries, multi-Gigabytes),  
and given how slow ZLib is (50 - 80 MB/s, depending on level and data),  
and the fact I'm running on large desktop machine with many cores, I'd like to  
parallelize the compression aspects, and leave the raw ZIP IO to the [vtable](https://www.sqlite.org/zipfile.html).

On the decompression (read) side, the current vtable already has access to the  
uncompressed data via the `rawdata` column, but not the CRC32 for that entry, to  
be used by the outside parallel code to check the uncompressed data.

The compressed size is implicit from the blob returned from the uncompressed data,  
although I guess that makes `length(rawdata)` more expensive that it should be, no?

On the compression (write) side, both the uncompressed size `sz` and `rawdata` must be  
`NULL` for now, while in my case I'd like them to not be (and instead have `data` NULL).  
And there's again no CRC32 column exposed, to write it directly (computed outside, in parallel, like the `rawdata`).

If I were to change [zipfile.c](https://sqlite.org/src/file/ext/misc/zipfile.c) to expose the `crc` and *compressed-size* columns (e.g. `szz`),  
and allow writing `(sz, rawdata, crc)`, would such a contribution have a chance to be incorporated?

Regarding `length(data)` (which `sz` replaces) and `length(rawdata)` (no equivalent column for now),  
should we override the `length()` for that vtable? Is there even a way to know `length()` is called  
on one of our special virtual columns, to answer the question directly, w/o actually reading and allocating  
those blob values, when all one wants is their lengths?

Thanks, --DD