Beginner questions: db in scm; avoiding a uuid primary key

(1) By stan423321 on 2022-02-01 15:47:43 [link] [source]

Thanks for your time reading this.

I would like to develop, essentially, a simple graph editor. The use case involves diffing and merging graphs in a source code control management environment or equivalent. I cannot rely on users coming up with unique names for nodes while making their additions to a graph, because historically they caused collisions often enough. Based on my limited experience it seems like a UUID based node identifier would be adequate enough. But 128 bits seems too long for SQLite to merge with a row id.

Q1. If one vacuums an SQLite database, is it suitable for source control, or should I consider a separate export/import format for that purpose, because of unreproducible layout or any other reason?

Q2. Is there a standard SQLite community alternative for UUIDs that fits in row ids while staying reasonably collisionless? They seem like a primary primary key pick otherwise. Some ideas and why I am not satisfied with them:

Twitter-made snowflake id fits in 63 bits with huge timestamp error margin, but relies on central user enumeration which may be annoying in practice, wI don't see extending that part enough to mitigate this problem while keeping enough of the timestamp and fitting into 64 bits
without row id tables may be problematic if nodes are big, and they probably will be
some sort of internal table remapping uuid prefixes to whatever that is actually stored in the document sounds like easy source of bugs
separating node ids for diff/merge from db primary key sidesteps most of this, but with that done I cannot use e.g. session extension to implement diff/merge
not caring and deploying a rowid table with implicit primary key index may actually not work with sessions either

Any already obvious criticism unrelated to the questions is also welcome.

Thanks in advance.

(2) By Richard Damon (RichardDamon) on 2022-02-01 16:38:53 in reply to 1 [link] [source]

First question is simple, a SQLite database will be a single amorphous blob to most source control systems, so merging differences will not be available unless YOU do something to provide it.

For your second, what I have tended to do for this sort of thing is use a ‘local’ non guid for the key, but also include a GUID column in the base record. Then when you go to import some other data into your set, during the import you build a temporary map of that datasets IDs to what you decide to map them to in your data set, translating all foreign key references as you go.

If the GUIDs match, you map them input to the same number, if not, you map the input to a new number.

The one difficulty is if two people might assign different GUIDs to something that should be the same, either you need a way to broadcast that ID or you need to keep a table of foreign equivalents

(3) By stan423321 on 2022-02-01 21:27:47 in reply to 2 [link] [source]

Thank you for your feedback.

I understand that I would need to support diffing and merging on database files myself, tools or not. Still, a question remains if the likes of git and fossil can handle a "snapshot" database efficiently.

To the best of my understanding, this is related to how different two databases with identical current content may be, and normally I know the answer to be "very much" since SQLite developers felt the need to make the special content hash tool. But I was wondering how much VACUUM changes things here?

As for the primary key remapping, the approach you described makes perfect sense. It is just a shame that the session extension seemingly cannot be utilized here as is, it seems to do something really close to what I will need to do.

The case where two users add equivalent nodes is, indeed, where other identification schemes would prevail, but it is not that different from one user deciding later some nodes are (or just may be, while experimenting) redundant. So I will need to do something about that regardless.

(4) By Richard Damon (RichardDamon) on 2022-02-01 22:20:06 in reply to 3 [link] [source]

As far as the source control program is concerned, ANY change in the database will likely need a complete copy saved, and same ‘changes’ that don’t nominally change what the database engine sees the data as (like a vacuum) will cause this saving. Because SQLite works on fixed side records, some special diff engines might be able to record just some limited difference, but most normal SOURCE control programs don’t do that, as that isn’t how source files normally vary.

(5) By Larry Brasfield (larrybr) on 2022-02-01 22:51:08 in reply to 4 [link] [source]

SQLite works on fixed side records, ...

Given that "side" records are mentioned nowhere in the docs, I presume you meant "fixed size records". Such are not used in SQLite for user data storage. If you look here, in the table for serial types greater than 9, you will see that the record size varies along with the content size for BLOBs and strings on disk.

(6) By Richard Damon (RichardDamon) on 2022-02-02 13:00:31 in reply to 5 [link] [source]

Yes, I meant SIZE. (mobile typing can be problematic at times)

While ROWS are stored in a variable size format, multiple rows are stored in fixed-size 'pages' that get managed and updated as a unit. (If a row is bigger than a page, it uses multiple pages).

That means that a difference engine could just check which pages were changed and just change those, and the adding of a record 'in the middle of the database doesn't shift the rest of the data in the file, as would happen if you replace a line of text with a longer line.

(7) By Vadim Goncharov (nuclight) on 2022-02-05 16:57:37 in reply to 1 [source]

Do you really need to get rid of rowid? Make your PRIMARY KEY a 16-byte blob, or even make it consisting from two 64-bit columns (e.g. one for stamp and other for user id), and do source merging not on binary, but on dump - then in SQL there will be no rowid.

Aside that, you may try to use reduced timestamp field in Snowflake ID (do you really expect your program to live for decades?), or use e.g. a SipHash as a hash function, in a manner similar to SHA-1 in distributed version control systems...

(8) By stan423321 on 2022-02-06 09:35:57 in reply to 7 [link] [source]

The problem I see with Snowflake ID is not "decades" by themselves. If it was, one could just chop off miliseconds instead, or even replace timestamp entirely with sequencing numbers. Rather, the trouble comes from the user identity, or, in Snowflake terms, machine ID, which Twitter can arbitrarily assign to servers, while for user applications that is not ideal. PRNGs occasionally get skewed, which, when combined with birthday paradox, makes it hard to estimate a safe length that doesn't go for something close to a MAC address anyway.