binary .dump format

(1) By hitchmanr on 2021-10-13 05:10:19 [link]

Is there a way to get a file similar to what .dump outputs in a binary format? The hex encoding of binary blobs doubles the output size for my use case.

I'd like to backup a large (~60GB) database with minimal downtime. .backup makes a copy of the database with an acceptable locking period, but each copy has many incidental pointer-level differences depending on the precise page layouts, reducing the effectiveness of deduplicating backup tools that match longer runs of content.

The dump format is much better for deduplication, apart from the unnecessary extra size from formatting and parsing overhead required to import. A binary dump format would be ideal, perhaps reusing the record format already present.

(2) By Andreas Kupries (andreas-kupries) on 2021-10-13 07:22:46 in reply to 1 [link]

Have you tried using a textual .dump followed by a compression stage, be it `gzip`, `bzip2`, `xz`, or `zstd` ?

Would that get you back into a size range you would consider acceptable ?

Note that AFAIK no binary .dump format has been defined for sqlite itself.

(11.3) By hitchmanr on 2021-10-13 16:52:34 edited from 11.2 in reply to 2 [link]

Compression tends to reduce the effectiveness of deduplication, since the compressed versions of similar files have very different byte sequences. I'd prefer to avoid the extra overhead of compression altogether.

There *are* some attempts to improve it (like gzip --rsyncable), so this is still worth a try. Even better, zstd has a --rsyncable flag, so let's try:

```
$ sqlite3 ext.db .dump | zstd --rsyncable -o ext.dump.zstd
36.63%   (4217978198 => 1544940304 bytes, ext.dump.zstd) 
$ time sqlite3 posts.db .dump | zstd --rsyncable -o posts.dump.zstd
51.47%   (98522848018 => 50713380137 bytes, posts.dump.zstd) 
real	31m16.458s
```

This produces a compressed dump that's 20-50% smaller than the input DBs (!) (partially thanks to omitted indexes), but `sqlite3 .dump` maxes out a CPU core and only produces output at ~50MB/s, which is much slower than a vacuum operation. I'll test again after 24 hours to see how big the incremental is, but this is probably the best solution without requiring additional code-- just somewhat slow thanks to all the extra text formatting operations.

Edit: actually, this is even better than I realized. In WAL mode the .dump reader doesn't block writers, so spending 30 minutes generating a dump isn't a real problem!

(3) By Ryan Smith (cuz) on 2021-10-13 09:51:58 in reply to 1 [link]

A binary dump format is not only not available, it's not even theoretically possible since any byte sequence can be present in any blob which makes format bytes impossible, or at least very difficult - something like MIME boundaries works but is heavy on checking for boundary matches in content, which doesn't matter much for typical e-mail, but will for DBs. It also means even small blobs will add large boundary identifiers, which will take away from the saving.

I suppose a best-case enhancement for large blob data would be 3-to-4 encoding (aka Base64) rather than the current hex representation in .dump files, which gives you a ~33% increase on blob size in stead of ~100%[1] increase and is de-duplication safe.

This change, or alternate ability will be needed on both the .dump side and the .import side, perhaps indicated with starting a Base64 binary with '0z' in stead of '0x' used for Hex, or some such. Should not be much effort to implement either, but it's certainly not available at this point.

Andreas' compression suggestion is better for full file deduplication mechanisms, but if you have intra-file data section (byte-run) dedupliaction it won't work, plus if the data is similar through blobs it will reduce size well, but if the data blobs are random or noise-like, compression would do more harm than good (same reason why PNGs of photos are much much larger than JPEGs, yet a PNG of a solid colour background weighs almost nothing while its JPEG is still huge).
Text compresses quite well (because of similarity of a language's lexemes and morphemes), so if the ratio of text-to-blobs in the DB is high, then compression may be quite good regardless of the above. I suppose testing is needed for your data. Either way, I think that suggestion is the most viable avenue for you to explore if size-saving is your goal.

[1] I use approximations ~nn% because Hex strings include the '0x' added to a blob, if the blob was 3 bytes long, the resulting text would be 2 + 3 x 2 = 8 bytes long, which is nearly a 200% increase in size, and as blob length increase, that percentage approaches 100%. Similarly Base64 will have such a marker, but it also requires 3-byte alignment so can end with one or two "=" characters, producing a saw-tooth size-gain graph that smooths out (--> 33%) for larger blobs.
i.e - Lots of small blobs are the worst, whatever encoding you use.

(5) By Andreas Kupries (andreas-kupries) on 2021-10-13 10:15:17 in reply to 3 [link]

> A binary dump format is not only not available, it's not even theoretically possible since any byte sequence can be present in any blob which makes format bytes impossible, or at least very difficult - something like MIME boundaries works but is heavy on checking for boundary matches in content, which doesn't matter much for typical e-mail, but will for DBs. It also means even small blobs will add large boundary identifiers, which will take away from the saving.

I disagree here. A format using a length/value format would need no encoding of the blobs, nor boundary markers. I.e. each value, binary or not, would be preceded by its length in bytes. This could a be simple fixed 64bit unsigned int, or some kind of varint coding to keeping coding of small numbers short(er). Lots of varints codings to choose from.

A natural extension of LV is TLV, i.e. adding a tag before the LV sequence. The tag describes what the value is for. Lots of binary formats using this kind of structure (TIFF, JPEG, EXIF, PNG, etc. ...)

Note that I read the original request as a binary format for the entire dump, and not just the binary blobs he has.

(6) By ddevienne on 2021-10-13 10:18:23 in reply to 3 [link]

> it's not even theoretically possible 

Not really. Binary dumps typically just prefix arbitrary byte sequences with a length.  
Yes, that makes the dump *context-dependent*, and any missing or corrupt byte can make  
the whole dump corrupt, but they are ways to detect such cases too.

You could also encode (base16, base64, ascci85) small blobs in the main dump file,  
and offload larger blobs into a separate paged companion file. One could even use  
content addressable schemes (e.g SHA-3 like Fossil) for blobs, gaining blob dedup too.

For such a large DB, using a snapshot'ing file-system, like ZFS, would allow instant  
*shallow copies*, which would allow to compute a diff (via SQLite's own DB diff tool)  
where time is not as critical, and keeping only the diff for backups perhaps.

(8) By Ryan Smith (cuz) on 2021-10-13 12:25:21 in reply to 6 [link]

Yes I agree, my contemplation imagined something that will be handled by a text processor and even editor, which can still use the mentioned companion files, or if text-editing isn't required, then length sections can work where context dependence is a bigger issue than corrupt bytes I think, but true, in such a case it's certainly not "theoretically impossible".

The TLV suggestion can work also, but then the entire data should be TLV sections, (even Schema, normal data) like it is in the file-types Andreas mentioned, avoiding context dependence and simplifying readers/writers. Much like the usual way of serializing data.

Either way, the moment you can no longer load an sqlite dump into a text editor (or importable via the standard cli sql reader), I think it stops being what it aimed to be at birth. In that regard, Base64 is drop-in replacement/alternate for hex and can still work 100%.

I like the suggestion using SQLite DIFF files on ZFS (or similar), maybe writing a full backup weekly, say, and diffs in between - the "lightness" of which will wholly depend on how often existing blobs change.

I have not actually tried using diffs for backup in sqlite - I wonder if you (or someone else maybe) have done this and can say something about how well it works and mention any caveats?

(9) By David Raymond (dvdraymond) on 2021-10-13 14:59:05 in reply to 3 [link]

I'm gonna risk showing my low knowledge and ask something.

For a binary dump of the data, why not just make a straight copy of all the leaf table b-tree pages? (I'm sticking my fingers in my ears and ignoring WITHOUT ROWID tables for now)

Go through each of the tables, and dump out the leaf pages in btree order (not original physical page order). If a leaf has overflow pages, maybe put them in order right after the leaf they overflow from.

We'll be storing sqlite_master, so a loader tool will have the whole schema. All the interior btree pages and all index pages (again, ignoring WITHOUT ROWID for now) are just there for quick search and wouldn't have had any data that needed storing. And when restoring a dump we're just gonna visit each record once and don't need quick searches.

Re-creating the tree part shouldn't be too bad if you have all the leaves, right? You could either decode them one at a time and do a bunch of inserts to the new file, or in a fancier move copy them all at once to the new file, then re-create the interior pages with something akin to create index. The only copied pages you'd have to make updates to would be for new root page numbers in sqlite_master, and overflow page pointers. Right? You'd probably just need a new header to list the start and end pages of each table in the dump.

Just storing the original leaf pages as-is would also let you re-use a bunch of already written code I'd bet.

Admittedly it wouldn't really work for comparing dumps of equivalent data, as how things are distributed between pages and on each page could be different. But as far as a "binary .dump format", why not?

Anyway, let me know how insane I'm talking. And apologies if it really is insane.

(10) By ddevienne on 2021-10-13 15:17:03 in reply to 9 [link]

I'm just as *unknowledged* as you, but I like your plan, David :)

You're still *wasting* the free space in the middle of pages,  
but keeping everything page-aligned has benefits too.

(13) By hitchmanr on 2021-10-13 16:45:00 in reply to 9 [link]

This is probably the highest performing way to implement it, but also the most complex. Even vacuum internally is just a big `INSERT INTO ... SELECT * FROM`.

I think you can emit Leaf Cells instead of Leaf Pages, dumping the btrees as ordered key-value pairs to be recreated by an import process. To stitch it together, you'd create a btree, insert the key-value entries for the btree, and adjust the root page numbers for each table/index in sqlite_schema.

(4) By anonymous on 2021-10-13 10:13:00 in reply to 1 [link]

> each copy has many incidental pointer-level differences depending on the precise page layouts, reducing the effectiveness of deduplicating backup tools

How reproducible are the resulting files if you `VACUUM` them afterwards?

Maybe we could look for other ways of "canonicalizing" an SQLite3 database?

(14) By anonymous on 2021-10-13 19:31:28 in reply to 4 [link]

You can do it in one step with <code>VACUUM INTO</code>.

(7) By Richard Hipp (drh) on 2021-10-13 12:10:01 in reply to 1 [link]

> Is there a way to get a file similar to what .dump outputs in a binary format?

Yes.
The <https://www.sqlite.org/src/file/ext/misc/dbdump.c> file contains a
stand-alone implementation of ".dump".  You can make a copy of that file
and modify it to generate whatever binary output format you desire.

How do you propose to get your binary-dump format back into an SQLite
database?

(12) By hitchmanr on 2021-10-13 16:20:07 in reply to 7 [link]

Thanks, my next idea was to modify the dumper to emit a binary format that's effectively the schema text and then the rows packed together, and writing another tool to load them and do the appropriate schema modifications and bulk insertion.

(20) By anonymous on 2021-10-16 07:30:09 in reply to 12 [link]

Are you sure you won't get better deduplication with backup API + `VACUUM`?

(15.1) By Simon Slavin (slavin) on 2021-10-14 10:10:11 edited from 15.0 in reply to 1 [link]

Your major complaint seems to be that the text representation of BLOBs doubles their length.  I can understand you wanting a custom-written solution for your particular SQLite project.  But years of experience tells me you'll be better off with this solution.

1) Use the existing .dump facility to produce the text dump file
2) Use pkzip or some other command-line facility to compress the dump file.

Not only will this compress the BLOBs, it'll compress everything else too.  I haven't tried it on databases with big BLOBs but it's very good at compressing normal SQLite database dumps.  And it'll do it using standard tools, widely available on many platforms, that you don't have to write and maintain yourself.

(16) By ddevienne on 2021-10-14 11:47:25 in reply to 15.1

With ZLib compressing at around 60MB/s in my tests, for a 60GB DB file,  
dumped to 120GB of SQL text, that's around 2,000s of compression right there...  
On top of the actual .dump itself, of course (unless concurrent if piped maybe).

The advantage of a custom binary dump, of table-pages only (see David's),  
or table-cells only (see hitchmanr's), is that you limit the IO, and don't  
need to decode the record cells (in the first case). But then you are on your  
own the re-assemble the DB though!

You'd need to investigate with [DB Stats](https://www.sqlite.org/dbstat.html) to find out how much you'd save precisely on that DB,  
to see if it's worth it time-wise, against the heavy dev-investment needed.

For a completely different take, one could also use WAL mode and [Litestream](https://litestream.io/blog/why-i-built-litestream/),  
backup'ing the WAL pieces Litestream generates. A very different approach...

PS: Richard mentioned they had activity in that area too, but unfortunately we haven't heard anything since.

(17) By Richard Damon (RichardDamon) on 2021-10-14 12:07:19 in reply to 16 [link]

One thought, the original request seemed to be based on wanting minimal differences in file contents for 'small' changes in the database, so an incremental backup could see just small parts of the file changed.

Any conversion of the database into 'dump' records would cause ALL of the file after any record that changes length or is added/removed to move and thus be a new difference.

My guess is that unless you spend a LOT of effort to make a dump that works to minimize this sort of difference, the raw database may well be a near minimal difference set, as it naturally wants to minimize how much of the file it changes to record differences as disk I/O is the most expensive part of the database access.

(18.1) By Simon Slavin (slavin) on 2021-10-14 14:33:26 edited from 18.0 in reply to 17 [link]

In that case, instead of using <code>.dump</code>, the OP should be using

<https://www.sqlite.org/sqldiff.html>

and then compressing the result.  But the analysis phase of <code>sqldiff</code> might take a long time.

(19) By Richard Damon (RichardDamon) on 2021-10-14 14:46:57 in reply to 18.1 [link]

The full answer might be to pull a copy like they are currently seeming to do, and the rather than backup that copy, do the diff and back that up. The other side of the backup then might need to apply the diff, to its copy.