SQLite Forum

CBOR Support & challenging DB design problem for keeping history with big schema changes

### CBOR in SQLite

First, as a preliminary to main question below, I would ask for CBOR support in SQLite (I'll need CBOR as part of solution to main problem). CBOR (Concise Binary Object Representation) is an Internet Standard, RFC 7049, purposed essentially as a "binary JSON" format, targeted for simplicity and compactness (e.g. in IoT world) and **stable for decades**, but yet extensible (there is RFC 8949 with clarififed spec / org / extensions reg. procedures etc., without changing actual format).

What is special about CBOR among all other "binary JSONs" is (apart from being Standard and being simple to implement - look at http://cbor.io how many implementations already available, it's really implementable in few hundred lines of C code) focus on extensibility: CBOR has concept of so called _tags_ - a non-data integer prepended to data item (or another tag). Tags are usually used for hinting a type on a bare integer/float/string, e.g. if that is URL, a date or some class. One can look at https://www.iana.org/assignments/cbor-tags for how many tags were already registered by people in years.

Tags are also used to extend the format itself. For example, one of the first such registered extensions were tags 25 and 256 for compression of repeated strings (this is a typical in arrays of objects with the same keys, for example), see spec at http://cbor.schmorp.de/stringref - it's very simple and clever trick: we count encoded strings, keeping internal array/dict, and emit tag+integer with number of string which was already serialized earlier in subdocument. And yes, using this extension show significant space savings on some data.

The SQLite JSON1 page says that binary format is still reserved because no encoding was found to be significantly smaller or faster. Now you know that things has been changed since then.

### Schema design questions for changing historical data 

Second, and main, question, is for - how to design a database schema for keeping historical data (presumably forever) which fields change over the time.

I am going to write a messenger client application (in Perl and Tk) for Telegram, but probably other protocols may be supported later, too, especially Web ones (even Fossil forums, eh?) - if the problems described below are solved.

The main goal is to keep messages and other info in local databases forever. You may be know that such entities keep their info on server, either Web one or accessed via special API, as in Telegram, not n the user's machine - and user thus may suddenly loose their digital life or part of it. Some (notably Telegram) even allow one user to delete all messages in dialog with another user for that user - imagine that was something valuable, or opposite, bad, and now you lost everything and can't go to court with it. Thus I want to write client that saves everything (well, except of big files) and restores fundamental Netizen's right to retain every information received. This was _modus vivendi_ in old days with logs, but for displaying today's complex data in GUI - plain text logs are not enough.

Data for such messages in such protocols can be roughly expressed as trees, JSON or like. I will use CBOR (essentially, I forced to use it instead of JSON) because of:

1) There are some fields that are binary - in JSON one need crutches such as base64.
2) The information in each object in Telegram is of some type, or class in OOP terms: CBOR allows to save it by means of e.g. tag 26, and in JSON one again need invent crutches like fields with reserved names - and hope that owners of API, which you do not control, will not clash with this name; this is of course unsuitable at "forever" distance.
3) Long term storage will require space, and CBOR allows aforementioned compression of strings, which I want to augment even further by storing once a set of data which is virtually prepended to each BLOB in DB to get compression from this start, not from scratch - like dictionary in zlib, but zlib allows only 32768 bytes dictionaries, too small.

Okay, the problem of SQLite not supporting CBOR is easy - may be solved by my application function. Now the real problem.

Suppose you have created tables/columns for current API version, and stored your message. Then, API is changed - Telegram do it once a month on average - and you have another fields now. Existing applications "solve" this either but not storing messages locally at all or by keeping only texts. Not storing is simple - messages are just re-requested each time, think of it like IMAP or forum's Web page. Suppose you update your DB schema according to protocol's API, and re-requesting messages from server. Now consider:

1) What happens if some message was deleted from server? The whole goal of my project is to keep them for such case.
2) Updating DB schema for complex things is a pain and requires a trained DBA - this is **not** what ordinary home user can or will do, and taking SQLite is also because of this unattended mode reason.

This means instead of columns may be JSON or similar documents, but here is next big problem - **INDEXes**. How to do this when your fields not only added or deleted (this could be with pain, but split to columns), but when a same-named field could change type and even became an object split to three fields instead of one integer?

And even if this split/rename solved somehow, then, **how to index individual fields of JSON or result of application function** ?

Next big problem, saving historical data means you also save edited versions of messages. Essentially, this becomes **like a Fossil or specialized VCS for messaging data**. But observe that not everything must be saved - an edited update on the API may be just _for a typo_ or for _one integer for votes_. This opens a DoS vector if _every_ such update is saved (servers of course keep only last version).

Also, consider we have to save and version not only messages, but some other objects, too - notably, information about user. And use of that time version when displaying messages of that time - for example, entire discussion (or trolling) may go about someone's avatar which was changed later; showing later version avatar won't help to understand it.

Again on compression, suppose you have several accounts on one protocols, or even several protocols, and want to forward messages between them, or observe other users doing such forwards or copy-paste. This is why I tend to do this multiprotocol - for keeping each piece of info on one DB only once. However, this immediately produce **privacy/security/access** problem - e.g. in Telegram, each link to other entity, user, attached file or such, is augmented with a token individual for this user (called _access hash_ or _file reference_); this is also to help fight spam bots. So now you also need:

1) store most of fields of message identical on several accounts, thus in single copy
2) store _some_ fields per-copy, and may pe probably unversioned

Given all problems above, and considering that even "overall" structure may change over time - e.g. a year ago Telegram added "comments" to messages, so if previously a "message" entity would be just contained in "chat" entity, now may be 3 levels, and also, same message could now belong to different "containers". This leads to thought that if all these were solved and make a little step further, scheme could be generalized not only to Telegram but many other protocols (this what really lead me to multiprotocol idea, which I already spoilered earlier).

The things are complicated with that, while I was taught some relational algebra and heard of 3rd normal form, I am more of system programmer and not an SQL expert :) E.g. I don't know how to (efficiently) express trees in SQL database...

I tried to think the following way: imagine, we are three decades ago, before SQL standard, and having only those Unix instruments. Then, we could create a structure in a filesystem, with directories as containers, meaning protocol/account/chat/forum/thread and files as individual objects - essentially, inodes, to allow "hard links". Each level in "path", that is directory name, is either numeric ID (enough in simple models like Telegram) or real name (think forums, Slashdot, Reddit). And file (inode data) would be RCS with just one branch - time. Then try to translate this "filesystem" to SQL tables - this may also help to solve indexing "inodes" by time issue. My thoughts stopped here as I don't know how to do such tree better, and what to do with individual fields, still.

I observed DB structure of Fossil repo, but this is probably too generic (also, why it uses strings like ticket attrs as strings and not IDs in other table to JOIN? Are there performance issues?). However, may be I'll need to use Fossil's Delta function, but don't know how to couple this with FTS3/FTS4 (may be FTS5?) for text of messages...

Any ideas? Help much appreciated.

WBR, nuclight