SQLite Forum

Faster, more compact sqlar archives
Login

Faster, more compact sqlar archives

(1) By Yair lenga (Yairlenga) on 2022-04-18 05:22:09 [link] [source]

I am planning to use SQLite archive to store large number of documents in compressed format.

The documents same structure. I measured compression with regular deflate, vs using predefined dictionary (arbitrary document in the repo), was able to achieve 2.5x saving. Unfortunately, the archive command, and the underlining compress uncompress functions do not offer the ability to use predefined dictionary.

I believe that adding an option to the .archive command to specify the docid of the dictionary (which will be passed to deflateSetDictionary) can deliver significant performance boost for the use case where there are many “similar” documents with small to medium size (depending on the actual file content, up to 2-3 times the size of the compress window).

Of course, important to keep track of the dictionary, otherwise not possible to uncompress. I can think of few alternatives:

  1. (The solution I do not like) require the extract command to provide the dictionary. Simple to implement, but not user friendly, good only for POC.

  2. Clone the source document into a temporary name (e.g. dict-NNNNN, where NNNNN Is autogenerated unique id. Potentially the hash value of the dictionary, or some other persistent id), and store it in the archive table as immutable document. The table holding the documents will be extended to store the dictionary id.

  3. (The solution I like) Create a extra table to hold dictionaries. similar as above. Cleaner from data model, but require slightly more space.

I do not have experience updating SQLite code, but after reviewing the source code for .archive it seems straight forward to implement. Would hope current maintainers of this function help with integration, feedback. Of course, looking for other users feedback.

Thank, Yair

(2) By Vadim Goncharov (nuclight) on 2022-05-08 12:17:05 in reply to 1 [source]

I vote up for not only this, but also for compress() and uncompress() extension functions.

BTW, I have a question for eponymous FTS4 options - if I implement them as application functions, to be called from FTS code - will I be able to do SELECT a dictionary document from the same database, i.e. this is recursively to SQLite code?

While here, I also ask for them (compress()/uncompress()), together with fossil_delta, to be added to set of available compile options so they may appear as configurable options e.g. in Linux distros.