SQLite Forum

Timeline
Login

7 forum posts by user hitchmanr

2021-10-13
16:52 Edit reply: binary .dump format (artifact: 43dc3c1f82 user: hitchmanr)

Compression tends to reduce the effectiveness of deduplication, since the compressed versions of similar files have very different byte sequences. I'd prefer to avoid the extra overhead of compression altogether.

There are some attempts to improve it (like gzip --rsyncable), so this is still worth a try. Even better, zstd has a --rsyncable flag, so let's try:

$ sqlite3 ext.db .dump | zstd --rsyncable -o ext.dump.zstd
36.63%   (4217978198 => 1544940304 bytes, ext.dump.zstd) 
$ time sqlite3 posts.db .dump | zstd --rsyncable -o posts.dump.zstd
51.47%   (98522848018 => 50713380137 bytes, posts.dump.zstd) 
real	31m16.458s

This produces a compressed dump that's 20-50% smaller than the input DBs (!) (partially thanks to omitted indexes), but sqlite3 .dump maxes out a CPU core and only produces output at ~50MB/s, which is much slower than a vacuum operation. I'll test again after 24 hours to see how big the incremental is, but this is probably the best solution without requiring additional code-- just somewhat slow thanks to all the extra text formatting operations.

Edit: actually, this is even better than I realized. In WAL mode the .dump reader doesn't block writers, so spending 30 minutes generating a dump isn't a real problem!

16:45 Reply: binary .dump format (artifact: 4d0e0cb4c7 user: hitchmanr)

This is probably the highest performing way to implement it, but also the most complex. Even vacuum internally is just a big INSERT INTO ... SELECT * FROM.

I think you can emit Leaf Cells instead of Leaf Pages, dumping the btrees as ordered key-value pairs to be recreated by an import process. To stitch it together, you'd create a btree, insert the key-value entries for the btree, and adjust the root page numbers for each table/index in sqlite_schema.

16:27 Edit reply: binary .dump format (artifact: 5a737662d2 user: hitchmanr)

Compression tends to reduce the effectiveness of deduplication, since the compressed versions of similar files have very different byte sequences. I'd prefer to avoid the extra overhead of compression altogether.

There are some attempts to improve it (like gzip --rsyncable), so this is still worth a try. Even better, zstd has a --rsyncable flag, so let's try:

$ sqlite3 ext.db .dump | zstd --rsyncable -o ext.dump.zstd
/*stdin*\            : 36.63%   (4217978198 => 1544940304 bytes, ext.dump.zstd) 

This produces a compressed dump that's around 1/4 the size of one of my input DBs (!), but sqlite3 .dump maxes out a CPU core and only produces output at ~50MB/s, which is much slower than a vacuum operation. I'll test again after 24 hours to see how big the incremental is, but this is probably the best solution without requiring additional code-- just somewhat slow thanks to all the extra text formatting operations.

Edit: actually, this is even better than I realized. In WAL mode the .dump reader doesn't block writers, so spending 30 minutes generating a dump isn't a real problem!

16:22 Edit reply: binary .dump format (artifact: 6e90374d3e user: hitchmanr)

Compression tends to reduce the effectiveness of deduplication, since the compressed versions of similar files have very different formats. I'd prefer to avoid the extra overhead of compression altogether.

There are some attempts to improve it (like gzip --rsyncable), so this is still worth a try. Even better, zstd has a --rsyncable flag, so let's try:

$ sqlite3 ext.db .dump | zstd --rsyncable -o ext.dump.zstd
/*stdin*\            : 36.63%   (4217978198 => 1544940304 bytes, ext.dump.zstd) 

This produces a compressed dump that's around 1/4 the size of one of my input DBs (!), but sqlite3 .dump maxes out a CPU core and only produces output at ~50MB/s, which is much slower than a vacuum operation. I'll test again after 24 hours to see how big the incremental is, but this is probably the best solution without requiring additional code-- just somewhat slow thanks to all the extra text formatting operations.

Edit: actually, this is even better than I realized. In WAL mode the .dump reader doesn't block writers, so spending 30 minutes generating a dump isn't a real problem!

16:20 Reply: binary .dump format (artifact: 70514f1b20 user: hitchmanr)

Thanks, my next idea was to modify the dumper to emit a binary format that's effectively the schema text and then the rows packed together, and writing another tool to load them and do the appropriate schema modifications and bulk insertion.

16:15 Reply: binary .dump format (artifact: 5651e6d221 user: hitchmanr)

Compression tends to reduce the effectiveness of deduplication, since the compressed versions of similar files have very different formats. I'd prefer to avoid the extra overhead of compression altogether.

There are some attempts to improve it (like gzip --rsyncable), so this is still worth a try. Even better, zstd has a --rsyncable flag, so let's try:

$ sqlite3 ext.db .dump | zstd --rsyncable -o ext.dump.zstd
/*stdin*\            : 36.63%   (4217978198 => 1544940304 bytes, ext.dump.zstd) 

This produces a compressed dump that's around 1/4 the size of one of my input DBs (!), but sqlite3 .dump maxes out a CPU core and only produces output at ~50MB/s, which is much slower than a vacuum operation. I'll test again after 24 hours to see how big the incremental is, but this is probably the best solution without requiring additional code-- just somewhat slow thanks to all the extra text formatting operations.

05:10 Post: binary .dump format (artifact: b15e0ae5c8 user: hitchmanr)

Is there a way to get a file similar to what .dump outputs in a binary format? The hex encoding of binary blobs doubles the output size for my use case.

I'd like to backup a large (~60GB) database with minimal downtime. .backup makes a copy of the database with an acceptable locking period, but each copy has many incidental pointer-level differences depending on the precise page layouts, reducing the effectiveness of deduplicating backup tools that match longer runs of content.

The dump format is much better for deduplication, apart from the unnecessary extra size from formatting and parsing overhead required to import. A binary dump format would be ideal, perhaps reusing the record format already present.