How much would you trust a page-level checksum ?
i've been using a version of SQLite with the checksum VFS recently
and have been musing: how much would I trust a checksum ?
SQLite goes to some lengths to avoid writing data which hasn't changed. It does this on a row-by-row basis, and that involves extra processing. And the processing doesn't catch every possibliity: the checks are performed only under certain circumstances. But the checksum VFS would allow SQLite to detect, at the page level, whether it's necessary to rewrite a page.
There are advantages for this: most databases have more than one row per page, so I think less processing would be required. Both tables and indexes are stored in pages, and this technique would allow SQLite to notice when a table page needed an update, but the index page did not. Or /vice versa/. But I don't know enough about how SQLite works internally to evaluate this. Or to know whether it would be necessary or convenient to read the checksum of a page being replaced. I'd appreciate comments.
And the most important thing: given the size of the checksum used, what's the risk that a change in a page's contents would leave the checksum unchanged ?
Cksumvfs does not (by default) use a cryptographic checksum like sha3. That means that a malicious actor could probably devise some changes that would alter the content of a page without changing the checksum. It would take some amount of cleverness to do this, but it is doable.
Given the constraint above, the chances of a random change (such as a cosmic-ray induced bit-flip on the mass storage device) going undetected by the checksum are very small. It's a 64-bit checksum, after all.
A single bit error is certain to change the checksum, no matter its length.
It is double bit errors that bring the checksum length into play. Because of the way checksums are calculated, over a word width usually, a second bit error, if it is in the same bit position within the sum word and has the opposite polarity flip, will cure the first bit error's effect upon the checksum.
One might think that cosmic rays (or alpha particle hits) would flip a dynamic memory storage cell the same direction all the time. But cell readout methods do not consistently treat a charged or discharged cell as a 1 or 0 respectively. So opposite polarity errors are as likely as same polarity errors.
So, what are the odds of an undetected error? To first order, they are:
(probability of one bit error over the block checked)
(probability of another bit error over the block checked)
Nsum_bits / 2.
. (That last 2 is because only half of the double bit errors "fix" the checksum.)
Of course, those bit error probabilities depend on how long the block is allowed to linger in a set of teensy capacitors.
Bit errors in persistent storage media are a different problem. There, for modern devices (recent decades), error detection and correction methods are used, which do not have the above weakness. As I recall, the errors undetected by the many-bit ECC methods are virtually always multi-bit errors for which a checksum is as fine as 1 / 2^Nsum_bits detection probability.
Okay, folks. Thank you for your answers. The most significant aspects of your answers are that the checksum algorithm, like all good checksum algorithms, is sensitive to position. So swapping around two different octets, or the values of two different rows, changes the checksum. And that any attempt to fool the checksum algorithm would be ridiculously difficult without reading the raw database file.
So let us suppose I trust that a 64-bit checksum (either the one already used, or a better one) is good enough for my purpose. Now I ask the remaining question.
SQLite currently has code in which stops it from writing records which haven't changed to the file. Suppose it checked for changes using a page-level checksum instead. Would that be faster, and/or lead to a smaller codebase, than the current checks which are carried out at the row/field level ? i'm prepared for answers like "Sometimes yes. sometimes no." or "Impossible to tell without writing the code, and we don't have the time. We look forward to seeing your results.".
One sacrifice would be that you'd have to take the time to calculate a checksum for every page written, and that database files would be a little longer. Another is that SQLite would occasionally have to write two pages instead of one, so SQLite would be slower. One advantage would be that since every page would have a checksum, it would be trivial to check those checksums every time a page was read, and thereby spot page-level corruption of a database file. Another is that, under normal circumstances, a page contains data for more than one row, so fewer checks would need to be done, so SQLite would be faster.
I don't know how complicated and time-consuming it is for SQLite to check for pointless rewrites. Or what proportion of pointless rewrites the code currently catches. Thoughts welcome.
Is computing one checksum, and comparing it to another faster than just comparing two sequences? That depends on how fast the checksum algorithm is, and the relative speeds of computation and sequential memory access. I'm sure people can find systems where either one wins.
Even if you get around that issue, I believe a 64-bit checksum is too small.
Birthday paradox says (roughly) that given random numbers in 0 ... N, you should expect duplicates to show up when you have roughly sqrt(N) numbers.
For the smallest sqlite page size, that "roughly" shows up at a couple of terabytes. That is too small for comfort, and one reason UUID's aren't only 64 bits.
I think you'd want to use a good 128 bit checksum, instead.
As to whether it is better to check for changes as you update a row or field, or to just wait until commit time and check the entire page my guess is:
Doing checks at the page level is simpler to code, if your only concern is to avoid writing unchanged pages. If you also want to avoid updating indexes that refer only to unchanged fields (I don't know if sqlite does that optimization), I'd guess that waiting to see if the page is dirty before updating indexes makes things trickier.
I suspect that page-level checks (with or without a checksum) are a performance boost only if it is very common for a page to get transactions where a sequence of update/delete/insert results in no net change to the page.
I do see a value in checksums, just to catch accidents or bugs. I just don't believe they are likely to be much of a speedup for learning whether or not a bunch of small changes resulted in any net change.
(6.1) By Keith Medcalf (kmedcalf) on 2021-07-12 21:55:34 edited from 6.0 in reply to 5 [link] [source]
The purpose of a "checksum" or "cyclic redundancy code" is not to ensure that the data upon which it is computed is "the same" as some other set of data which also yields the same "checksum" or "cyclic redundancy code", but rather to give an "indication" that the data upon which the code was computed is unlikely to have changed BY ACCIDENT from when the code was computed.
The properties of a "secure hash" are exactly the same, the only real difference being the length and complexity of the calculation of the code.
If you generate a code B on data A (with bits in B being less than the bits in A), and also a code C on data D (with bits in C being less that the bits in D), where the bit lengths on B and C are the same, and the bit lengths of A and D are the same, then B == C does not imply A == D. Only if B == C and A == D does A == D.
Any implication that because B == C that A == D is merely happenstance, and not a certainty.
However, if B == code(A) then you might with extreme caution be able to posit that A is unchanged from when the original code(A) was calculated. However, the value is that if B <> code(A) then you know with absolute certainty that the value of A has changed.
In other words, no "secure hash code" nor "cyclic redundancy check" nor "checksum" can prove that the data on which it has been computed has not changed but rather only that it has changed.
(7.1) By Larry Brasfield (larrybr) on 2021-07-12 23:12:18 edited from 7.0 in reply to 5 [link] [source]
I believe a 64-bit checksum is too small.
Birthday paradox says (roughly) that given random numbers in 0 ... N, you should expect duplicates to show up when you have roughly sqrt(N) numbers.
With my trusty calculator1, I worked out how often a random collection of bytes would have the same 64-bit checksum as another random collection. Within 10%, if today's population on Earth were to toss the figurative dice to get another random collection and checksum every second, then among the whole population we could expect to see a match about every 74 years.
I'm not seeing how the Birthday Paradox applies to such infinitesimal odds. But I am not going to be sneering at those who settle for 64-bit check words. Somewhere in the assessment of diminishing return on investment, one should account for meteor strikes or man-made big events.
- 2^64 / (24*60*60 * 365) / 7.9e9
(9.1) By Bill Wade (billwade) on 2021-07-13 13:11:57 edited from 9.0 in reply to 7.1 [link] [source]
You calculated the total number of checksums, and divided that by the time it would take to generate that number of checksums.
It is certainly possible that you could generate N=2^64 checksums and not get a single duplicate, just as it is possible that given a random collection of month-day pairs, you could have 365 with no duplicates.
However, https://en.wikipedia.org/wiki/Birthday_problem says that the probability of a duplicate exceeds 50% once you have 23 random dates.
Above I used 23 as (roughly) sqrt(365). Admittedly that was very rough.
Edit (really bad math in the previous version):
That web page says that you a 50% of at least one collision in 64-bit random numbers once you have generated 5.1e9 of those numbers, which is before every person in the world has rolled the dice once.
(10.1) By Larry Brasfield (larrybr) on 2021-07-13 13:55:25 edited from 10.0 in reply to 9.0 [link] [source]
Bill, I have long had a policy against getting into calendar arguments or arithmetic arguments. I was tempted to add statistics and probability arguments to that list and ignore your (mistaken) contention.
But that square-root assertion just sticks in my craw, and using it or the Birthday Paradox to show me wrong is just too much. So, here is some SQL to calculate the correct1 odds of no shared birthdays among 23 people (ignoring the complexity of some people being born on February 29.):
select exp(sum(ln_odds_against)) as group_odds_against from (
with recursive SharedBirthdayShouters(bfn)
as (values(1) union all select bfn+1
where bfn <= 23)
select ln((1.0 - (bfn-1)/365.0)) as ln_odds_against
. You should notice that there are no square roots anywhere in that calculation. And the significance of "23^2 has similar magnitude as 365" completely escapes me.
There is a much more fundamental problem with your dismal view of checksum failure odds (and contradiction of my assertion on that.) The Birthday Paradox simply does not apply. If we imagine all X billion people somehow able to discover that they got a same random checksum as anybody else did as they collected some number of them, then you could use the same arithmetic as explains that pseudo-paradox to predict the odds. But that is not the arithmetic that predicts how often any one of X billion people will see a checksum given once to each of them at the start of that 74 years, each looking for a repeat only in their own series. The birthday collision is about matching among a set of limited range random numbers, not about matching a single one.
If you want to prove yourself right about the chances of the same checksum occurring among a fixed number of randomly created ones, just plug power(2,64) in place of 23 in that query. But you should not claim that answer contradicts my much simpler (and faster) probability calculation.
- If 23 students are asked to speak their birthdays in turn, and all are effectively compelled to shout "That's mine too!" if they hear their birthday spoken (by another student), what is the probability that none of them will so shout?
The the probability of for that shout not happening as the 1st one speaks is: (1 - 0/365) The the probability of that shout not happening as the 2nd one speaks is: (1 - 1/365) The the probability of that shout not happening as the Nth one speaks is: (1 - (N-1)/365)
The probability that the whole group will finish speaking without so shouting is the product of the probabilities at each step.
Since there is no built-in aggregate product function, I simulate it by taking the natural log at each step, summing those over the ensemble, then taking the inverse natural log of the sum.
For the 23 students, the probability of no shouts is about 0.46, as anybody can see using a recent version SQLite shell.
For Simon's original question:
"what's the risk that a change in a page's contents would leave the checksum unchanged ?"
I agree that the birthday paradox does not apply when you are considering only two pages (before and after). Accidental collisions would be exceedingly rare for some reasonable definition of a good 64-bit checksum. As others have pointed out, checksums can reliably tell you that they are different, and if "exceedingly rare" isn't good enough for you, do the full sequence test determine if they are indeed the same.
However the subject had drifted a bit. For malicious attackers, I'd worry that it wasn't enough. I'd worry about any 64-bit "digital signature." Birthday paradox isn't directly applicable, but it, along with knowledge of the checksum internals are tools for the attacker, and the smaller N gets, the more useful some such tools become.
The birthday paradox really kicks in when you start comparing the checksum of every page in the database (or across databases), perhaps to reduce duplication. There the numbers start to get large enough to "reasonably" expect 64-bit checksums to match when the contents don't, even in the absence of bugs and malicious actors.
I probably extracted Simon's original post a bit too far. My apologies.
(8.1) By Keith Medcalf (kmedcalf) on 2021-07-12 23:41:51 edited from 8.0 in reply to 5 [link] [source]
Actually, each unique value of the 8-byte checksum represents 1.9212776604857023e+24 different 4088 byte long block values.
So if the 8-byte "hash code" matches then you have 1 of the 1.9212776604857023e+24 4088 byte blocks that will generate that code.
However if the 8-byte recorded hash does not match the 4088 byte block from which it was generated, you can be absolutely certain that the block of 4088 bytes is not any of the 1.9212776604857023e+24 sequences that would generate that value.
NB this is assuming an "equal distribution" -- the better the distribution, the better (or more secure, to use the accepted parlance) the generated hash. That is why SHA2-512 is better than a parity byte -- simply because it is longer.
Reining in the current mathematical discussion, I'm still interested in my original question. Would it take less code, or be faster, for SQLite to test for pointless re-writes at the page level rather than the field level as it does now ?
(13) By Richard Damon (RichardDamon) on 2021-07-14 13:42:34 in reply to 12 [link] [source]
I am fairly sure that it is quicker to compare the two blocks rather than computing the checksum, at least if the checksum is stored as a piece of the block on disk, so the old data needed to be read in.
(14) By Scott Robison (casaderobison) on 2021-07-14 21:08:09 in reply to 12 [link] [source]
I don't know enough about exactly how SQLite is doing the field comparison. Presumably the existing data is in the cache, so it can do a byte for byte comparison. If it is a "typical string" (smallish) or an integer or double, then comparing the fields is probably much faster. If the field is a large text or blob, then perhaps comparing the previously computed hash of a block in the cache to a freshly computed hash of a to be written block would be faster.
This introduces the added complexity of what to do with a really large text or blob that spans two or more pages.
Ultimately, the size of the changes is going to be the driver of whether page based hashing or field based comparisons are more efficient, and the tipping point will depend on the complexity of the hashing algorithm used.