Generate SHA for a table

(1.3) By adrian on 2022-10-15 16:55:27 edited from 1.2 [link] [source]

Hi all,

I've got a massive (multi GB) database and I need to verify it's content. Before, a binary SHA of the file was done, but on attempting applying a delta it would result in different final db binary layout. This seems to be due to, among other things, the delta being large and not completing, requiring continuing the application of the delta next time the application ran, resulting in the db's unused data (marked unused rows/pages) changing depending on where the delta apply stopped and restarted.

A content SHA algorithm was then made for a table, but is significantly slower (by about 10x), as first it gets the table with a SELECT statement, and then goes. row by row, and query the contents of each column by column and generating a running SHA for all of that data. That process has a lot of overhead.

I've noticed that there was a SHA1 Extension added around 2017 that adds the commands sha1 and sha1_query. This might have potential in generating the verification that I need, but doesn't appear to be available by default and I can find very little documentation on the web.

Would using those commands be the best approach to generating a SHA1? How would I enable using these commands and how would I implement a start/stop/resume mechanism using them? If not, what would be the best way to implement generating a SHA1 for a table with the ability to stop and resume at a moment's notice?

Thx

(2) By Richard Hipp (drh) on 2022-10-15 18:10:49 in reply to 1.3 [link] [source]

In the CLI you have the extension function named "sha3_query()". The argument is an SQL SELECT statement. The result is a SHA3-256 hash of the result. So:

SELECT sha3_query('SELECT * FROM table_to_hash');

(3) By Aask (AAsk1902) on 2022-10-15 18:29:45 in reply to 2 [link] [source]

Do you need to Hex the result?

SELECT hex(sha3_query('SELECT * FROM employees'));

(4.1) By adrian on 2022-10-16 01:47:24 edited from 4.0 in reply to 3 [link] [source]

Nice! Thx!

However, if this can take a while, then I need to know how I can do this in a way that would allow for stopping and restarting.

(5) By anonymous on 2022-10-16 02:24:51 in reply to 4.1 [link] [source]

adding to Dr Hipp's suggestion, you may want an order by clause on the select to make sure that you always hash in the same sequence. you can test with reverse_unordered_selects pragma.

then you can do ranges with limit, offset etc

(6) By adrian on 2022-10-16 03:18:42 in reply to 4.1 [link] [source]

Oh wow! This was able to produce a SHA3 for all the tables in a 3Gb DB in under 3 min. Hopefully this is fairly linear and I won't have to worry about stopping and restarting.

THANKS for all the help!

(7) By ddevienne on 2022-10-17 07:21:03 in reply to 2 [link] [source]

There's also the (probably older since SHA1 only) dbhash utility.

(8) By Sunny Saini (SunnySaini_com) on 2022-10-17 12:27:52 in reply to 7 [link] [source]

Unable to use dbhash.exe as I am not a programmer and thus, can't compile it.

I request developers to add the dbhash.exe file along with the CLI.

(9) By adrian on 2022-10-17 12:35:56 in reply to 8 [link] [source]

Good to know. Thx.

(10) By adrian on 2022-10-17 12:42:52 in reply to 5 [source]

Yes. Sorting and chunking is prolly the only way to do that. I just fear that chunking will result in redoing work already done, because you would sort and limit on the first N rows and hash that. Then you would have to sort and get the 2nd N rows. Wash, rinse and repeat. Wouldn't that sorting result in redoing a lot of work. Also to get to kth N rows, you would have to go through 0..(k-1)th N rows, also repeating a lot of work. Can sqlite somehow cache this information so as not to have to repeat the work already done?

(11) By adrian on 2022-10-17 12:46:35 in reply to 3 [link] [source]

So, I was wondering, is there any reason why a sha2 was never implemented? I've read that sha3 has issues with collisions.

(12) By Richard Hipp (drh) on 2022-10-17 13:00:57 in reply to 11 [link] [source]

I've read that sha3 has issues with collisions.

Where did you read that?

(13) By Gunter Hick (gunter_hick) on 2022-10-17 13:01:52 in reply to 10 [link] [source]

You need to do sorting/chunking by an index so that you can remember the last procesed row and quickly find the starting point.

SELECT ... FROM table WHERE rowid > :ROWID ORDER BY rowid LIMIT 20;

Begin with 0, rebind with the last rowid retrieved to get the next slice

You can use any unique index both in the ORDER BY and the "restart constraints". Re-dign work only occurs if you insist on ordering by something not already in an index.

(14) By Keith Medcalf (kmedcalf) on 2022-10-17 15:33:07 in reply to 12 [link] [source]

Every and any algorithm which takes something that is N bits of entropy long, and reduces it to a M bits of entropy, where M < N, will necessarily have collisions.

This is simple physics and mathematics and cannot be changed without altering the physical and mathematical laws under which we have observed that the Universe operates.

SHA3 reduces something which is N bits (N > 512) to a 512 bit summarization. Therefore, there will be collisions. It is a mathematical certainty that if the original document was 513 bits long, and you were generating a 512 bit hash, that you would encounter 13407807929942597099574024998205846127479365820592393377723561443721764030073546976801874298166903427690031858186486050853753882811946569946433649006084096 collisions in that data.

(15) By anonymous on 2022-10-19 04:46:13 in reply to 10 [link] [source]

To add to what Gunter suggests, use an OFFSET clause. This will allow sqlite to have a starting point.

The only other thing I would change about what Gunter advised is that I suggest you think about what index is stable for you. ROWID could be different if, for example, you deleted and inserted rows (among other things). Thus I would suggest that a stable index (that is the key represents a method to uniquely identify the same row's data even after upsert, update etc) is preferable (i.e. A defined unique primary key).

Given that your select's where clause uses such an index (and the select has the order by, limit/offset I suggested previously) I would be very must surprised if it slower than greased lightning^2. But you should test your performance hypothesis in all cases.

(16) By Gunter Hick (gunter_hick) on 2022-10-19 06:29:06 in reply to 15 [link] [source]

I must strongly advise against using an OFFSET clause. The OFFSET clause is implemented as a counter that gets decremented every time a result row is produced. The rows not returned still have to be produced. This is exactly re-doing the work.

CREATE TEMP TABLE x (r INTEGER PRIMARY KEY, i INTEGER, t TEXT);
CREATE UNQIUE INDEX x_i ON x(i);

(1) SELECT * FROM x ORDER BY t LIMIT 1000,10;

This is the slowest possible way. First, it has to read all of the rows, sort them , discard 1000 rows and return 10 rows.

(2) SELECT * FROM x ORDER BY i LIMIT 1000,10;

This is better because it avoids the sort, but it still has to discard 1000 rows while reading the index and then retrieve the result rows from the table.

(3) SELECT * FROM x ORDER BY r LIMIT 1000,10;

Still better because the rowid is the fastest for retrieval.

(4) SELECT * FROM x WHERE t > :t ORDER BY t LIMIT 10;

This is the slowest possible non-skipping way, because it has to read all the rows matching the constraint, order them, and then return the first 10. Note that (1) runs slower the higher the offset is (because more result rows are skipped), whereas (4) runs faster the "further down" the table you are (because more rows are discarded before sorting).

(5) SELECT * FROM x WHERE i > :i LIMIT 10;

Very much faster because it seeks to the starting point in the index and produces rows from there.

(6) SELECT * FROM x WHERE r > :r LIMIT 10;

Still faster because it omits the index lookups from (5)

If each chunk is processed in a separate transaction, intervening INSERT and DELETE statements may cause some rows to be not processed (depending on where in the total sequence the operation is relative to chunk processing). And UPDATES that change the location of a record may cause that record to be processed anywhere from never (i.e. skip "backwards") to infinitely many times (always skipping "forwards").

(17) By ddevienne on 2022-10-19 07:36:45 in reply to 16 [link] [source]

On that topic, see also Reasonably efficient pagination without OFFSET

(18) By anonymous on 2022-10-19 14:05:51 in reply to 16 [link] [source]

from gunter's excellent post on offset, it shows any approach taken should be benchmarked.

the tradeoff on ROWID would have to considered.

a mechanism, such as marking rows as completed and resetting the mark with disciplined insert/update etc could address the row selection but changes original constraints. as does generated columns, etc.

implementing the offset with a where clause could make sense as well, benchmarking the app of course.

the last approach improves the index stability vs rowid, but is slower than rowid.

the programmer will have to choose from the various tradeoffs.

(19) By adrian on 2022-10-19 15:24:41 in reply to 17 [link] [source]

From that post, it gives this code:

SELECT foo, bar, baz, quux FROM table
 WHERE oid NOT IN ( SELECT oid FROM table
                    ORDER BY title ASC LIMIT 50 )
 ORDER BY title ASC LIMIT 10

Am I reading this correctly? It is giving the rows 51-61? And this is faster than OFFSET because...?

(20) By adrian on 2022-10-19 15:36:18 in reply to 16 [link] [source]

Thx Gunter,

I'm a little unclear with the WHERE clause. What does i > :i mean? I don't understand what that colon is for.

(21) By anonymous on 2022-10-19 15:38:04 in reply to 20 [link] [source]

It is just a bind placeholder. Substitute the right index for the pagination

(22.1) By adrian on 2022-10-19 15:47:43 edited from 22.0 in reply to 13 [link] [source]

Hi Gunter,

Is this useful between sqlite connection sessions? From what I'm reading in our code, we're doing:

SELECT * FROM table ORDER BY column LIMIT -1 OFFSET last_row_processed;

From there, we continue our SHA generating algo, going row by row and if interrupted by a shutdown command, we store last_row_processed and the current SHA value. We don't use rowid because we are using another column for a primary key.

Are you saying that we should change this to:

SELECT * from table WHERE column > column_value_of_last_row_processed ORDER BY column LIMIT -1;

Thx

(23) By adrian on 2022-10-19 16:16:22 in reply to 14 [link] [source]

Also heard that it's slower on intel systems. Which is why the speed that I'm getting from sqlite's sha implementation all that much more impressive, and suspect. Going to have to go through the code to see what's happening, unless someone else has some experience with this and could give me the cliff notes.

(24) By David Raymond (dvdraymond) on 2022-10-19 16:42:04 in reply to 22.1 [link] [source]

OFFSET means you have to produce every single record up to the one you want to start with.

Think of your nice big old fashioned phone book with pages full of peoples' names and numbers, listed in alphabetical order by name.

Saying "offset 30000 limit 1000" would be the equivalent of starting at the front of the phone book and counting 30,000 names, then starting from there.

Saying "where [primary key field(s)] > [last value(s) you processed] limit 1000" is like saying "give me the next records after "Smith, John". You can use the index to jump right to your start point without needing to go through every single page counting the number of names.

(25.2) By adrian on 2022-10-19 18:01:22 edited from 25.1 in reply to 22.1 [link] [source]

So I attempted to do that which I stated and I got some really bad results:

$ time cat use_where | sqlite3 /w/tmp/hdlm.sqlite

real    0m1.676s
user    0m1.436s
sys     0m0.076s

$ time cat use_offset | sqlite3 /w/tmp/hdlm.sqlite

real    0m1.768s
user    0m1.452s
sys     0m0.154s

WHERE had slightly better performance. (Sorry, updated as I had an incorrect SELECT statement).

(26) By Chris Locke (chrisjlocke1) on 2022-10-19 18:04:58 in reply to 20 [link] [source]

Had to look it up myself...

Its for parameters. So :i is a parameter.

(27) By KIT.james (kjames3411) on 2022-10-20 02:15:04 in reply to 14 [link] [source]

Everyone knows that. He is asking you if there are issues with collisions. ie if there have been attacks made that prevents use of the algorithm.

(28) By Gunter Hick (gunter_hick) on 2022-10-20 06:25:01 in reply to 19 [link] [source]

I don't expect this to be faster. It is just a complicated way to avoid explicitly using OFFSET, which complication comes at a cost.

(29) By Gunter Hick (gunter_hick) on 2022-10-20 07:06:04 in reply to 20 [link] [source]

The :i is an SQL variable so that you can prepare the statement once and rebind the last i value retrieved for the "next page" of results.

Lets look at the effort required to page through a table of N rows at P rows per page. For simplicity, assume that P*M == N.

Using OFFSET and LIMIT

First page is O(P), next is O(P+P), next is O(P+P+P), etc. up to O(N); this is a series O(P*(1+2+3+...+M)) which, according to 9 year old Gauss is O(P*M*(M+1)/2) or P/2*O(M^^2+M) which is basically O(N^^2)

Using seek and scan:

Each page is O(log N) to find the starting point and O(P) to display a page. This sums up to O(M log N) + O(M*P) or O(N + N/P log N) which his basically O(N log N)

So where is the tradeoff?

Using offset O and limit P in a table N rows long, O(O+P) <=> O(P + log N) determines the breakeven, which reduces to O <=> log N. So if the offset is less than log N, displaying that page is probably faster, if all you want is a single page. Once the user starts paging through the results, the experience is going to get sluggish very fast.

(30) By Ryan Smith (cuz) on 2022-10-20 09:44:27 in reply to 27 [link] [source]

In that case the answer is easy, No, it's not a problem.

TLDR: Collisions will exist, guaranteed, for ANY algorithm, but that does not imply insecurity.

What Keith is saying is that collisions is a fact of life for ANY hashing algorithm, but that does not imply vulnerability to attack, which is why Richard asked where the OP read this? Because whoever wrote it is probably done so in service of click-baiting the reader. SHA1 is as strong an encryption as is needed for standard secure data hashes.

Mathematically one can state that for an infinite set of random datasets with minimum length dL, there exists an infinte set of collision hashes for any hashing algorithm that produces a digest of length hL where hL < dL.

It used to be MD5, which is very fast, was the defacto hash when the need to hash online hosted files became apparent, so when people with more modern fast computers showed it possible to engineer a file that would MD5 the same digest as another, i.e. force the collision in linear human timescales, then we all collectively decided it is time to move on to something harder to force a collision on. SHA1 is/was one such option, but it came at the price of all the computers in the world now having to work slightly harder to compute the hashes. (This change alone probably upped the world's computing energy consumption by a MWh per year).

Of course since then some people have been able to force a file collision on SHA1 too, but with equipment worth millions and at considerable time and effort and suddenly screamed that "SHA1 is now completely unsafe"[1], which is nonsense. Being able to produce two files with the same hash is certainly a security concern for someone hosting files, but there has never been a password broken by SHA1 brute forcing (i.e. making ranges of potential passwords until one of them produces the same SHA1 as another), in fact, it's near infinitely faster to just brute-force try different passwords until matching the original.

It's reminiscent of a situation in which everyone in the world used the same pick-proof lock-mechanism on their doors, and then one day, after trying picking it with extremely expensive lockpicks for more than a year, some people finally opened the 1 lock, and then everyone in the World going "Amagad! anyone can open any door at any time now! we all gotta change locks now!". It's nonsense. This analogy is even above what they actually achieved, they did not actually break any hash, they simply tried adjusting a file-content until finding file-data that produced the same hash, an eventuality which, as pointed out by the math statement above, is a given. Note also that this only involves cases where the encrypted file/data is directly accessible and trying to make data with a colliding hash, and not for any data decryption or any use-case where access is protected by standard brute-force protection, such as imposing rate-limits or retry limits.

I will consider it compromised one day when they show it possible to decrypt data encrypted with a SHA1 coded key where the physical key is longer than the SHA1 code (i.e. brute-forcing the hash, not the password), like the sort of use-case we use it for in database applications. In fact I believe in such a case, brute-forcing the encrypted key digest bytes would actually be faster.

Certainly for everyday database hashing, SHA1 is perfectly safe, but SHA256 and 512 can be used (at considerably more computing cost) if you really feel uneasy.

My advice: Don't feel uneasy - it's just a feeling.

[1] See this Example article claiming "SHA1 is now completely unsafe" making such ignorant statements as "If a weakness is found in a hash function that allows for two files to have the same digest, the function is considered cryptographically broken". Such drivel - Engineering two files/data with the same hash is an absolute certainty given a data length wider than the hash digest itself (the very principle that makes Bitcoin mining collisions a problem). None of that implies any hash algorithm insecurity or "brokenness". In fact one could state the strength of a cryptographic algorithm by how long it will take on average (in cpu cycles) to engineer a longer dataset that produces a collision with another - but collisions WILL exist, guaranteed, for any algorithm.

(31) By Donal Fellows (dkfellows) on 2022-10-20 16:37:48 in reply to 30 [link] [source]

The problem with a hashing algorithm comes not when there are any collisions (hash algorithms have collisions by the pigeonhole principle), but rather when you are able to have a particular hash value given to you can generate a document with whatever contents you want in it that has that hash value. The task is much easier when the document format includes anything that isn't readily visible in the result (hidden attributes, comments, etc) but should still be very difficult anyway.

MD5 is broken because that property has now been established. SHA-1 is also somewhat weak to it now, though it is more difficult to do there. SHA-512 is... really not likely to have many collisions in the first place, as it uses many more bits for the hash code and has a very strong algorithm. Collisions are possible there, but very very unlikely to be a problem for now; the nominated-text failure mode is nowhere close with it.

(32) By Ryan Smith (cuz) on 2022-10-20 18:10:04 in reply to 31 [link] [source]

The problem with a hashing algorithm comes not when there are any collisions (hash algorithms have collisions by the pigeonhole principle), but rather when you are able to have a particular hash value given to you can generate a document with whatever contents you want in it that has that hash value.

Yes, but that is just the problem, it is NOT realistically possible with SHA1 - even saying it is "somewhat vulnerable" is vastly overestimating the ability to engineer a document with a specific hash. The team at Google with their insane multi-million-dollar equipment took months and 9+ quintillion hashing operations to engineer 1 single PDF doc that matched the hash of another doc... and if you change 1 byte inside that original doc, you have to start the process all over - so no, it is not within the grasp of the average anyone to willy nilly pop out naughty files with designer hashes of the SHA1 variety. At least not for many years to come.

That said, as stated before, if you are going to host files for which you do care about veracity of content, absolutely please do use a hashing algorithm higher than SHA1 (Google may just hate you enough to put their resources behind faking one of your files), but as for your DB data hashes, you can sleep easy, it ain't about to be broken and calling for any DB engine to abandon/replace SHA1 hashing is premature.

It is worth noting that the "weakness" in MD5 was not that it could brute force hashes to find a duplicate (collision) - that is possible with all hashing algorithms given enough cpu cycles, but rather that an algorithm was found by which the required cpu cycles to find a duplicate hash was greatly reduced, bringing it within reach of even less powerful computers to do in realistic time-frames. Now Google demonstrated an algorithm that does reduce the amount of hashes needed to find a duplicate for SHA1, but not enough to make an attack plausible by any average or even very above-average computer. You still need them quintillions of hashes.

The only thing that happened is that we now know roughly how strong SHA1 is, requiring around 9 Quintillion hash operations (and data morphs) to find a set of bytes which produce the same SHA1 key as another, whereas we do not know the number for SHA3/512 yet, they're still behind the horizon, but they absolutely do all have a number.

The most important thing here, which I can't emphasize enough, is however hard it is to find a file-collision, even if it happens, it only affects data veracity checks, checking if a file or bag-of-bytes is the precise bag-of-bytes the creator produced and not some other bag-of-bytes. It has no bearing whatsoever on hashes/hashed data/keys etc. as you may typically use in a DB.
SHA1 isn't "broken", not yet at least.

(33) By Warren Young (wyoung) on 2022-10-20 18:12:17 in reply to 32 [link] [source]

The team at Google with their insane multi-million-dollar equipment…

…was the first attack. The second one was done with $45k in costs, 3 years ago.

(34) By Ryan Smith (cuz) on 2022-10-20 18:52:55 in reply to 33 [link] [source]

Those were for chosen-prefix attacks, not pre-image attacks (as is what is required to successfully spoof an untouched file with a known hash, as what Google achieved).

But sure, I will say that if the attacker has the original file (not just its hash) and can manipulate the data (prefixing it) for the original file before publishing the hash (let's assume the attacker found a way) and then create a bad file with the same prefix and manipulated hash, then sure, you can have it for $45K. I'm however not really counting that, since if an attacker has that much access to your pre-published file data, surely it's much cheaper for them to simply mess with the files directly than fork over $45K.

(35) By Donald Griggs (dfgriggs) on 2022-10-20 18:54:23 in reply to 9 [link] [source]

If I'm thinking clearly here (and correct me if I'm not), there's a big difference in the use of a hash as Adrian (OP) required versus a need for a cryptographically secure hash.

When you just need a hash to detect changes in data under your control, one only cares if the algorithm is fast, the hash bit size is sufficient, and that it the hashes are not terribly lopsided. I.e., for two non-identical blocks of data, the chance of them returning the same hash is not much bigger than

1 / 2 ** hashBitsize

Even if the data is not always under one's control, if a bad actor can cause bad things by cracking the hash, yet that bad actor could more easily just change other things without cracking the hash, then again, a secure hash is not necessary, and in fact might be harmful by providing a false impression of security.

So a generally-regarded-as-unsafe algorithm such as MD5 would be wholly suitable for Adrian's purpose, regardless of how easy to crack -- now or in the future.

(36) By anonymous on 2022-10-20 19:00:38 in reply to 15 [link] [source]

clarification to above:

yes to what Gunter suggests regarding OFFSET clause, if you are going to have small chunks and many of them.

my advice: whatever method you use, if it's going to have skips/misses (not related to hash collisions) it would be less than confidence inspiring. if it's going to be less confidence inspiring, then code it up quickly and get on with solving the larger issue of managing the updates. in short, make it work correctly, then worry about making it faster. The worries about collisions, and speed seem way premature (IMO).

I suspect a properly designed mechanism would make relative performance a non issue.

(37.1) By Adrian Ho (lexfiend) on 2022-10-21 10:37:26 edited from 37.0 in reply to 25.2 [link] [source]

Regardless of what you expected, that's not how proper benchmarking is done. At the least, your first query primed your OS filesystem cache with your DB, so your second query actually had an "unfair advantage". Also, other stuff running on your machine could've slowed either or both tests.

Run each query multiple times, preferably with a tool like hyperfine that gives you warmups, statistical outlier detection and other Nice Things, to get a far better comparison.

(38) By adrian on 2022-10-24 11:34:18 in reply to 35 [link] [source]

Yes, thx Donald. We are assuming that the update connection is secure and we just need a way of validating the DB update that we are sending will result in a DB that we have on our server.

(39) By anonymous on 2022-10-24 22:11:56 in reply to 38 [link] [source]

Given you essentially are looking for a checksum rather than detecting malicious activity, have you considered having the hashes created as stored columns so you can just index them and hash the hashes?

depending on where you want to allocate the time to hash, update and check, doing the heavy lifting upfront during row creation/modification might work for you.

something like:

create table t( ID INT NOT NULL, SHA BLOB AS (sha3(CONTENT)) STORED, /* this could be CONTENT||other column.. */ CONTENT ANY

) STRICT;

create index hashidx on t (ID, SHA); /* let's prepare an index to grab what we need from a covered index */

/* generate a few sample entries */

insert into t SELECT value as ID, randomblob(100) as CONTENT from generate_series(1,1000000,1); /* a million rows is a good start */

explain select ID, SHA from t order by ID; /* show that we use the shaidx */

select hex(sha3_query('select ID, SHA from t order by ID')) as 'SHA_sum'; /* do a hash with a query that uses a covered index */

running the last select is fairly quick, yes? you can add a count of rows to mitigate the extra malicious row problem a bit as well.

(40) By anonymous on 2022-10-24 23:09:34 in reply to 38 [link] [source]

and a followup to with that structure: if your ID has UNIQUE (which it likely should), then the select becomes:

explain select ID, SHA from t indexed by hashidx order by ID; /* give sqlite a hint it can use our covered index */

obviously, the 'explain' is just to show that it's using the desired query plan.

When run on a small sample of 1 million rows of small blobs (though the covered index won't care about the content size) the select query runs in the online fiddle in about 1 second (depending on your processor of course).

Most of the time is in generating the million records. And again, if you are generating those records one at a time then you can put the hash time at time of creation, store the result, and do a hash of the hashes for a fairly quick check.

One drawback is you would have to know what columns you want to check (which arguably is not a drawback if one is from the camp that select * is prone to unexpected results from unexpected columns, column order, etc.).

You can do the whole thing, without relying on ROWID, and without skips/misses in perhaps a few seconds.

(41) By anonymous on 2022-10-24 23:14:27 in reply to 39 [link] [source]

a typo in the comment on the explain select line.. should read hashidx rather than shaidx.