SQLite Forum

The craziest thing I ever used SQLite for: partial file deduplication
Login

The craziest thing I ever used SQLite for: partial file deduplication

(1) By anonymous on 2022-07-25 12:00:39

The Apple File System has some support for data deduplication.
You can create a copy-on-write clone of a file, using no extra disk space
(until you start modifying either the original or the clone).

I wanted to use this for my games volume.  Multiple games using the same
game engine often have many support files in common.  Additionally,
multiple versions of the same game often have many game-specific files
in common.

Keeping track of this is not difficult.  You simply need to maintain
a database with a file table having size and content hash columns.

But what about partial matches?  For example, if you append files to a ZIP
archive, the old and new versions of the archive will have a common prefix.
When installing the new version, you could make a clone of the old one,
then copy only the non-matching parts from the source.

Keeping track of **this** gets more elaborate.  You need to store hashes
of each allocation block of each file.  Here's the final design:

```
    create table file (
        fileid integer
            primary key,
        size integer
            not null,
        moddate integer
            not null
    );

    create table block (
        fileid integer
            not null
            references file (fileid)
                on update restrict
                on delete cascade,
        blockix integer
            not null,
        hash integer
            not null,

        primary key (fileid, blockix)
    ) without rowid;

    create index block_hash_ix
        on block (hash, blockix);

    create temporary table newblock (
        blockix integer primary key,
        hash integer not null
    );
```

There is no file path, because macOS lets you look up files by id.
The hash values are cryptographic hashes truncated to 64 bits and
reinterpreted as integers.  I picked the BLAKE3 hash function because
its standard implementation has a convenient interface for truncated
and extended results.

The newblock table is used when adding a new file.  First, fill it up with
hashes computed from the source.  Then, find what existing file has the
most blocks in common with the new file **and** the index of each such
block:

```
    with similar (blockix, fileid) as
        (select blockix, fileid
            from newblock
                natural join block)
    select fileid, blockix from similar
        where fileid=
            (select fileid from similar
                group by fileid
                order by count(*) desc
                limit 1)
        order by blockix;
```

This query demonstrates the importance of database analysis.
Without statistics, things get painfully slow as the database grows:

```
    QUERY PLAN
    |--SEARCH block USING PRIMARY KEY (fileid=?)
    |--SCALAR SUBQUERY 2
    |  |--SCAN block
    |  |--SEARCH newblock USING INTEGER PRIMARY KEY (rowid=?)
    |  `--USE TEMP B-TREE FOR ORDER BY
    `--SEARCH newblock USING INTEGER PRIMARY KEY (rowid=?)
```

With statistics, things remain bearable:

```
    QUERY PLAN
    |--SEARCH block USING PRIMARY KEY (fileid=?)
    |--SCALAR SUBQUERY 2
    |  |--SCAN newblock
    |  |--SEARCH block USING COVERING INDEX block_hash_ix (hash=? AND blockix=?)
    |  |--USE TEMP B-TREE FOR GROUP BY
    |  `--USE TEMP B-TREE FOR ORDER BY
    `--SEARCH newblock USING INTEGER PRIMARY KEY (rowid=?)
```

So, does all this complexity actually save any disk space?  First, the
database tracking my games volume uses 729587 allocation blocks itself.
Here's a summary extracted from it:

|match|file count|total blocks|saved blocks|
--------------------------------------------
|full|259924:|9485545:|9485545:|
|partial|35340:|10551328:|1615241:|
|none|214368:|53298995:|0:|

The simpler alternative approach would use a schema like this:

```
    create table file (
        fileid integer
            primary key,
        size integer
            not null,
        moddate integer
            not null,
        hash integer
            not null
    );

    create index file_hash
        on file (hash);
```

That database would use 5314 allocation blocks.

In summary, the complex approach saves 1615241 extra blocks,
using 729587-5314=724273 more blocks for its database.

1615241>724273.  Win!

(2) By Richard Hipp (drh) on 2023-03-27 10:12:01 in reply to 1 [link]

Further discussion as sprung up at
<https://news.ycombinator.com/item?id=35317419>

(3) By anonymous on 2023-03-27 14:31:21 in reply to 2 [link]

Since somebody wondered how it all connects and I'm too lazy to create an account over there: a Perl program uses the database to perform the equivalent of `cp -R -p`