sqlite3 comparing two databases

(1) By failingprovince on 2025-01-24 12:51:16 [link] [source]

Hi,

I'm currently writing a zig wrapper around the SQLite C library. At the moment I'm working on a snapshot testing harness.

The current implementation compares a correct snapshot on-disk against an in-memory database. Queries applied to the snapshot are applied to the in-memory database, resulting in the same database.

Since dumping the database with C APIs is not possible and sqlite3_snapshot_get requires a lot of precondition (which could translate in restrictions on the user-side) I decided to go with the sqlite3_serialize function, which should give me a general way of comparing byte content of the database file.

So I wrote the following function call:

sqlite3_serialize(snapshot_connection, 0, null, 0);

But the bytes returned are SQLite format 3. I figured out that this is the header of a database file, but I'm sure the snapshot database also contains tables and values, so this doesn't make sense to me.

So I tried to see the size of the write size, which matches the size of the database file but not the size of the returned bytes.

I also tried to play a bit with the params:

sqlite3_serialize(snapshot_connection, "main", null, 0);
sqlite3_serialize(snapshot_connection, "temp", null, 0);

But they return either a null pointer or SQLite format 3.

I can't figure out what I'm doing wrong.

If someone has suggestions on better techniques for testing statements execution by my wrapper or my approach is just wrong, do not hesitate to share it with me.

(2) By Stephan Beal (stephan) on 2025-01-24 13:02:30 in reply to 1 [link] [source]

So I tried to see the size of the write size, which matches the size of the database file but not the size of the returned bytes.

You are presumably trying to inspect the returned database image with a tool which is stopping at the first non-printable character (one of which immediately follows the "SQLite format 3" header).

The size of the returned blob is returned via the 3rd argument to serialize(), which you have left out. You need something like:

sqlite3_int64 sz = 0;
unsigned char * m = sqlite3_serialize(snapshot_connection, "main", &sz, 0);

Then check the value of sz.

Also, the docs do not state that a value of 0/null is legal for the 2nd argument. You need to provide the name of a schema to back up.

(3) By failingprovince on 2025-01-24 14:05:05 in reply to 2 [source]

You are presumably trying to inspect the returned database image with a tool which is stopping at the first non-printable character (one of which immediately follows the "SQLite format 3" header).

The issue was that I tried to span the pointer into a slice with a function that assumed C pointers are null-terminated.

What seemed weird to me was the written size not matching the actual pointer size, whose len function also assumed C pointers were null-terminated XD.

Thanks for the hint on the right direction!

Also, the docs do not state that a value of 0/null is legal for the 2nd argument. You need to provide the name of a schema to back up.

Btw I found this possibility in the source code, not being sure what schema to pass, I just decided to let the API chose for me.

[amalgamation, sqlite3.c:53774]
if( zSchema==0 ) zSchema = db->aDb[0].zDbSName;

This code should infer the schema from the first argument, right?

(4) By Stephan Beal (stephan) on 2025-01-24 14:12:07 in reply to 3 [link] [source]

This code should infer the schema from the first argument, right?

That's indeed the case. We'll get the docs updated to either reflect that or (if Richard prefers) declare that behavior as undefined. Nobody should need to go look at the code to figure out what a given interface is supposed to do.

(5.1) By failingprovince on 2025-01-24 15:59:14 edited from 5.0 in reply to 4 [link] [source]

Btw does it make sense from your point of view to test a snapshot like this? Because I'm not sure 100% that the same queries will give the same byte-to-byte result, do they? Or am I seeing it too much deterministically?

If that's the case, do you have any suggestions on how to tackle the problem?

Edit:

I checked out the hash of two databases where the same queries were ran, they match.

Nobody should need to go look at the code to figure out what a given interface is supposed to do.

Imho I think people should look at both documentation AND code when they have a doubt. I for sure do that all the time.

It helps understanding the tools used and also learn a trick or two from smarter and skillful people.

(6) By Stephan Beal (stephan) on 2025-01-24 16:37:13 in reply to 5.1 [link] [source]

Because I'm not sure 100% that the same queries will give the same byte-to-byte result, do they? ... I checked out the hash of two databases where the same queries were ran, they match.

Even when the client-side data being stored is byte-for-byte identical, there's no specific guaranty that two databases holding that same data will be byte-for-byte identical. Maybe they will, maybe they won't.

If that's the case, do you have any suggestions on how to tackle the problem?

Determining semantic equivalence requires traversing the client data (as opposed to SQLite's header, B-Tree links, and whatever other metadata is stored in the db's binary image) and performing "an appropriate comparison" on it. What that precisely entails depends on the application.

It's likely that other forum-goers have approached this problem before and can provide some concrete suggestions.

Imho I think people should look at both documentation AND code...

That's all fine and good, but they shouldn't need to in order to use this library. The doc shortcoming you've pointed out will be addressed later today (it's already edited locally, just not yet checked in).

(7) By failingprovince on 2025-01-24 16:42:17 in reply to 6 [link] [source]

Even when the client-side data being stored is byte-for-byte identical, ... Determining semantic equivalence requires traversing the client data ...

Thanks for the suggestion and the clarification.

That's all fine and good, but they shouldn't need to in order to use this library.

Yes, that makes absolutely sense.

Thanks a lot for the (really) fast help!

(8) By Roger Binns (rogerbinns) on 2025-01-24 16:44:43 in reply to 5.1 [link] [source]

Have a look at sqldiff with source here.