Blob vs. Text for primary keys
(1) By Michael Hofmann (mthbgs) on 2021-03-04 15:02:36 [link] [source]
Hello,
I am working on an application that needs to store a large number of binary blobs, each one indexed by its SHA256 hash. The basic idea is to store blobs like this:
CREATE TABLE blob ( hash BLOB PRIMARY KEY, content BLOB NOT NULL );
Here, each entry in the "hash" column would entail 32 bytes containing the raw SHA256 hash. Alternatively, I've been thinking about storing a hex representation of the hash:
CREATE TABLE blob ( hash TEXT PRIMARY KEY, content BLOB NOT NULL );
I have experimented with both variants, and so far it seems the only difference is, as expected, a somewhat larger database size for the second version. What I would like to know: Is there is any functional difference, in particular with regard to performance, between BLOB and TEXT as a primary key?
(2) By Richard Hipp (drh) on 2021-03-04 16:00:22 in reply to 1 [source]
Both approaches should work fine. Storing the hash as a BLOB might be very slightly faster, since (as you observe) there is less content, hence less file I/O.
The Fossil version control system does something very much like this. But it stores the hash as text rather than as a blob. Performance is not an issue, and text is easier for developers to deal with when debugging.
(3) By Michael Hofmann (mthbgs) on 2021-03-04 18:45:33 in reply to 2 [link] [source]
Thank you for the clarification, that was very helpful.
And thank you for the link to the technical overview. My own use case is quite different from a VCS, but its a fascinating read nevertheless.