CBOR Support & challenging DB design problem for keeping history with big schema changes

(1) By Vadim Goncharov (nuclight) on 2021-11-24 15:01:16 [link] [source]

Hello!

CBOR in SQLite

First, as a preliminary to main question below, I would ask for CBOR support in SQLite (I'll need CBOR as part of solution to main problem). CBOR (Concise Binary Object Representation) is an Internet Standard, RFC 7049, purposed essentially as a "binary JSON" format, targeted for simplicity and compactness (e.g. in IoT world) and stable for decades, but yet extensible (there is RFC 8949 with clarififed spec / org / extensions reg. procedures etc., without changing actual format).

What is special about CBOR among all other "binary JSONs" is (apart from being Standard and being simple to implement - look at http://cbor.io how many implementations already available, it's really implementable in few hundred lines of C code) focus on extensibility: CBOR has concept of so called tags - a non-data integer prepended to data item (or another tag). Tags are usually used for hinting a type on a bare integer/float/string, e.g. if that is URL, a date or some class. One can look at https://www.iana.org/assignments/cbor-tags for how many tags were already registered by people in years.

Tags are also used to extend the format itself. For example, one of the first such registered extensions were tags 25 and 256 for compression of repeated strings (this is a typical in arrays of objects with the same keys, for example), see spec at http://cbor.schmorp.de/stringref - it's very simple and clever trick: we count encoded strings, keeping internal array/dict, and emit tag+integer with number of string which was already serialized earlier in subdocument. And yes, using this extension show significant space savings on some data.

The SQLite JSON1 page says that binary format is still reserved because no encoding was found to be significantly smaller or faster. Now you know that things has been changed since then.

Schema design questions for changing historical data

Second, and main, question, is for - how to design a database schema for keeping historical data (presumably forever) which fields change over the time.

I am going to write a messenger client application (in Perl and Tk) for Telegram, but probably other protocols may be supported later, too, especially Web ones (even Fossil forums, eh?) - if the problems described below are solved.

The main goal is to keep messages and other info in local databases forever. You may be know that such entities keep their info on server, either Web one or accessed via special API, as in Telegram, not n the user's machine - and user thus may suddenly loose their digital life or part of it. Some (notably Telegram) even allow one user to delete all messages in dialog with another user for that user - imagine that was something valuable, or opposite, bad, and now you lost everything and can't go to court with it. Thus I want to write client that saves everything (well, except of big files) and restores fundamental Netizen's right to retain every information received. This was modus vivendi in old days with logs, but for displaying today's complex data in GUI - plain text logs are not enough.

Data for such messages in such protocols can be roughly expressed as trees, JSON or like. I will use CBOR (essentially, I forced to use it instead of JSON) because of:

There are some fields that are binary - in JSON one need crutches such as base64.
The information in each object in Telegram is of some type, or class in OOP terms: CBOR allows to save it by means of e.g. tag 26, and in JSON one again need invent crutches like fields with reserved names - and hope that owners of API, which you do not control, will not clash with this name; this is of course unsuitable at "forever" distance.
Long term storage will require space, and CBOR allows aforementioned compression of strings, which I want to augment even further by storing once a set of data which is virtually prepended to each BLOB in DB to get compression from this start, not from scratch - like dictionary in zlib, but zlib allows only 32768 bytes dictionaries, too small.

Okay, the problem of SQLite not supporting CBOR is easy - may be solved by my application function. Now the real problem.

Suppose you have created tables/columns for current API version, and stored your message. Then, API is changed - Telegram do it once a month on average - and you have another fields now. Existing applications "solve" this either but not storing messages locally at all or by keeping only texts. Not storing is simple - messages are just re-requested each time, think of it like IMAP or forum's Web page. Suppose you update your DB schema according to protocol's API, and re-requesting messages from server. Now consider:

What happens if some message was deleted from server? The whole goal of my project is to keep them for such case.
Updating DB schema for complex things is a pain and requires a trained DBA - this is not what ordinary home user can or will do, and taking SQLite is also because of this unattended mode reason.

This means instead of columns may be JSON or similar documents, but here is next big problem - INDEXes. How to do this when your fields not only added or deleted (this could be with pain, but split to columns), but when a same-named field could change type and even became an object split to three fields instead of one integer?

And even if this split/rename solved somehow, then, how to index individual fields of JSON or result of application function ?

Next big problem, saving historical data means you also save edited versions of messages. Essentially, this becomes like a Fossil or specialized VCS for messaging data. But observe that not everything must be saved - an edited update on the API may be just for a typo or for one integer for votes. This opens a DoS vector if every such update is saved (servers of course keep only last version).

Also, consider we have to save and version not only messages, but some other objects, too - notably, information about user. And use of that time version when displaying messages of that time - for example, entire discussion (or trolling) may go about someone's avatar which was changed later; showing later version avatar won't help to understand it.

Again on compression, suppose you have several accounts on one protocols, or even several protocols, and want to forward messages between them, or observe other users doing such forwards or copy-paste. This is why I tend to do this multiprotocol - for keeping each piece of info on one DB only once. However, this immediately produce privacy/security/access problem - e.g. in Telegram, each link to other entity, user, attached file or such, is augmented with a token individual for this user (called access hash or file reference); this is also to help fight spam bots. So now you also need:

store most of fields of message identical on several accounts, thus in single copy
store some fields per-copy, and may pe probably unversioned

Given all problems above, and considering that even "overall" structure may change over time - e.g. a year ago Telegram added "comments" to messages, so if previously a "message" entity would be just contained in "chat" entity, now may be 3 levels, and also, same message could now belong to different "containers". This leads to thought that if all these were solved and make a little step further, scheme could be generalized not only to Telegram but many other protocols (this what really lead me to multiprotocol idea, which I already spoilered earlier).

The things are complicated with that, while I was taught some relational algebra and heard of 3rd normal form, I am more of system programmer and not an SQL expert :) E.g. I don't know how to (efficiently) express trees in SQL database...

I tried to think the following way: imagine, we are three decades ago, before SQL standard, and having only those Unix instruments. Then, we could create a structure in a filesystem, with directories as containers, meaning protocol/account/chat/forum/thread and files as individual objects - essentially, inodes, to allow "hard links". Each level in "path", that is directory name, is either numeric ID (enough in simple models like Telegram) or real name (think forums, Slashdot, Reddit). And file (inode data) would be RCS with just one branch - time. Then try to translate this "filesystem" to SQL tables - this may also help to solve indexing "inodes" by time issue. My thoughts stopped here as I don't know how to do such tree better, and what to do with individual fields, still.

I observed DB structure of Fossil repo, but this is probably too generic (also, why it uses strings like ticket attrs as strings and not IDs in other table to JOIN? Are there performance issues?). However, may be I'll need to use Fossil's Delta function, but don't know how to couple this with FTS3/FTS4 (may be FTS5?) for text of messages...

Any ideas? Help much appreciated.

WBR, nuclight

(2) By ddevienne on 2021-11-24 16:02:48 in reply to 1 [link] [source]

FWIW, regarding CBOR, this was brought up a long time ago when JSON1 was introduced,
and it's unlikely to come from the SQLite team itself, given the discussion at the time.
But nothing prevents you from doing it yourself, JSON1 is an extension,
and a CBOR extension can be similarly written, using the same techniques
(a bunch of set-returning functions, i.e. eponymous virtual tables, in SQLite speak :))

Sure, JSON1 comes from Richard, and is part of the amalgamation,
so it has 1st-party status. But since CBOR1 is unlikely to come
as 1st-party, then do it as 3rd-party in OSS, and if it's good,
and unlike the discussion at the time bring some speed-advantage
over JSON1, maybe it will pick-up a following. FWIW.

(3) By Vadim Goncharov (nuclight) on 2021-11-24 16:25:49 in reply to 2 [link] [source]

was brought up a long time ago when JSON1 was introduced, and it's unlikely to come from the SQLite team itself, given the discussion at the time [...] and unlike the discussion at the time bring some speed-advantage over JSON1

I presume a long time ago mentioned tag extensions did not exist yet, and it is what makes CBOR unique amongst all "binary JSON" variants, which are, in a nutshell, usually are not better much. So that discussion is most probably outdated, given that reason is not speed but features:

compression
ability to express things impossible in JSON (binary values, type info)
extensibility (JSON itself and all other binary JSONs are not extensible)

However, my main questions are not about CBOR support - if I could plug a function of my language's implementation into SQLite, this will be just fine.

The main question is how to build indexes on such, and AFAIK in JSON1 such indexes are also not possible.

(4) By ddevienne on 2021-11-24 17:29:27 in reply to 3 [source]

OK, I'll bite another time. No one said I was a reasonable man :)

You're going a bit all over the place Vadim, and most of what you write
is not specific to SQLite, so a bit Off Topic here IMHO. But in any case,
here is maybe a bit more information.

Regarding trees in SQLite, you need to look into CTEs, which allow to write recursive queries.
That's how one deals with filesystem-like hierarchies.

Regarding indexing CBOR, well, extensions can create shaddow tables,
which store custom preprocessed data used by custom MATCH operators.
That's how the various FTS implementations do it I believe, I'm not
very versed in those, I never delved into their code. But this is definitely
an advanced subject in SQLite, which very few people have experience with I believe.

I doubt this will help you much, sorry. Good luck. Over and out. --DD

(5) By Ryan Smith (cuz) on 2021-11-24 18:52:09 in reply to 3 [link] [source]

Sidestepping the CBOR request completely since I don't care one way or the other, but to comment on the secondary question:

if I could plug a function of my language's implementation into SQLite, this will be just fine.

This is pretty easy to do in SQLite, see here: sqlite.org/c3ref/create_function.html

The main question is how to build indexes on such...

Creating an Index on the result of a function has been possible in SQLite for quite some time now. So if you are successful at creating the functions you want to use (as above), then adding indexes based on them is trivial within the rules.

See here: sqlite.org/lang_createindex.html - Point 1.2 - Indexes on Expressions,
and here: sqlite.org/expridx.html - Note the section on restrictions.

If the intended functions are non-deterministic or violates any other of those rules, your only remedy will be, as Dominique suggested, making a virtual table module where you have control over indexing. This can be quite involved but is still very possible and example modules abound.

See here: sqlite.org/vtab.html
and here: sqlite.org/c3ref/module.html.