sqlar: is it reliable to use length(sqlar.blob)==sqlar.sz determine whether the data is compressed

(1) By 6kEs4Majrd on 2021-08-09 04:02:53 [link] [source]

https://sqlite.org/sqlar/doc/trunk/README.md

It says

The file is compressed if length(sqlar.blob)<sqlar.sz and is stored as plaintext if length(sqlar.blob)==sqlar.sz.

But it also says

the zlib format contains a two byte compression-type indentification header (0x78 0x9c) and a 4-byte checksum at the end

Could it be possible that length(sqlar.blob)==sqlar.sz in the latter case as the header/checksum can make up the length reduction of the compression?

(2) By Larry Brasfield (larrybr) on 2021-08-09 05:22:33 in reply to 1 [link] [source]

If you look at the code where the compress/leave-as-is decision is made, you will see that the phenomenon over which you worry is already taken into account. Look at line 359 in particular, where the "compressed" size used to decide whether to use compression already has the overhead header/trailer info incorporated.

(3) By 6kEs4Majrd on 2021-08-09 11:33:30 in reply to 2 [link] [source]

I don't follow the code. Could you explain it in plain language how it determines whether it is compressed or not in the case that I mentioned?

(4) By Stephan Beal (stephan) on 2021-08-09 11:38:29 in reply to 3 [link] [source]

I don't follow the code. Could you explain it in plain language how it determines whether it is compressed or not in the case that I mentioned?

In the very last block it checks whether the compressed data is smaller than the original. If so, it keeps the compressed data, else the compression is discarded and the original is retained.

(5) By 6kEs4Majrd on 2021-08-10 03:23:50 in reply to 4 [source]

I am not asking how the code works.

I am asking why it makes sense when length(sqlar.blob)==sqlar.sz, there is no compression, when length(sqlar.blob)<sqlar.sz there is compression?

Is it possible that length(sqlar.blob)>sqlar.sz given the header/checksum can increase the length?

(6.1) By Stephan Beal (stephan) on 2021-08-10 04:18:45 edited from 6.0 in reply to 5 [link] [source]

I am not asking how the code works.

That's precisely what you asked for before: "plain language" explanation of the code.

I am asking why it makes sense when length(sqlar.blob)==sqlar.sz, there is no compression, when length(sqlar.blob)<sqlar.sz there is compression?

My previous response answers that. sqlar will never produce compressed data which is the same size or larger than the original. If compression (including the header) would be the same size or larger than the original, the compressed results are discarded and the original data is used in its place.

(7) By 6kEs4Majrd on 2021-08-10 05:36:57 in reply to 6.1 [link] [source]

This makes more sense now.

The README.md is not very clear about this. Anybody can update it?

https://sqlite.org/sqlar/doc/trunk/README.md

(8.1) By Warren Young (wyoung) on 2021-08-10 18:40:22 edited from 8.0 in reply to 7 [link] [source]

Deleted

(9) By Larry Brasfield (larrybr) on 2021-08-10 18:27:30 in reply to 7 [link] [source]

I can see no reason to update that doc. It plainly says, "The file is compressed if length(sqlar.blob)<sqlar.sz and is stored as plaintext if length(sqlar.blob)==sqlar.sz." The commentary regarding compressed storage format details does not contradict that plain statement and can most sensibly be read as being additional detail rather than an exception. To block your strained over-interpretation would take additional verbiage which would only dilute the content for most readers, or cause them to wonder why it needed saying.