JSONB - Request for evaluation and comment

(1) By Richard Hipp (drh) on 2023-10-10 19:25:43 [source]

JSONB (https://sqlite.org/draft/json1.html#jsonbx) is a new binary storage format for JSON that is proposed for the next release (3.44.0) sometime toward the end of the calendar year. The advantages of using JSONB rather than RFC-8259 text JSON are:

JSONB is slightly more compact, taking up about 5% or 10% less disk space, on average.
JSONB is faster. The json_extract() function and the -> and ->> operators are more than twice as fast when their left-hand operand is JSONB instead of text JSON.

The JSONB enhancement is mostly feature-complete in that all of the planned new interfaces have been implemented. You are encouraged to download a pre-release snapshot from the Download page (https://sqlite.org/download.html) and try it out, if you can. Report back on:

Size and performance improvements or regressions in your application
Suggestions for improvement to the interfaces
Suggestions for documentation improvements

Your assistance in testing this new feature, so as to make it better, is greatly appreciated.

Note that the JSONB enhancement is still under active development. The code has not yet landed on trunk. There is a lot of optimization left to be done. The on-disk JSONB format and details of the new interfaces are all subject to change, so do not do anything with this prototype that needs to be long-lasting, as anything might change out from under you.

(2) By anonymous on 2023-10-10 23:57:43 in reply to 1 [link] [source]

Will there be any improvement with indexing on json expressions?

(3) By Nuno Cruces (ncruces) on 2023-10-11 00:44:38 in reply to 1 [link] [source]

Can you expand a bit on element types 4 and 6?

In general, would it be expected that clients (might?) ingest JSONB directly, or is it expected/assumed that clients (always?) provide standard JSON, use the jsonb scalar function to convert, and consider the JSONB format an internal SQLite implementation detail?

(7) By Larry Brasfield (larrybr) on 2023-10-11 03:03:12 in reply to 3 [link] [source]

would it be expected that clients (might?) ingest JSONB directly
... consider the JSONB format an internal SQLite implementation detail?

The expectation is that JSONB is stored in blobs in users' databases. The JSONB format, once it is released, would be stable and used to attain both storage space improvement and element access time improvement.

(8) By Nuno Cruces (ncruces) on 2023-10-11 08:44:23 in reply to 7 [link] [source]

OK, so if the format is stable, I guess some element types could use further specification/clarification.

Element types 4 and 6, number formats not covered by the JSON RFC: are those intended for JSON5 literals?

Element type 7, string literals with no escapes, could maybe use specifying the SQL special characters.

All string element types could do with clarifying: “The payload does not include string delimiters.”

(4) By anonymous on 2023-10-11 01:49:08 in reply to 1 [link] [source]

Why create a new format? What's wrong with MsgPack?

(5) By Code Hz (codehz) on 2023-10-11 02:19:30 in reply to 1 [link] [source]

JSONB is faster

This is actually surprising to me in a way, since in PostgreSQL, jsonb usually SLOWER (update/insert/select) in postgresql (maybe it is caused by conversation from/to user input, or the hashmap is not suitable for those tasks (most of json data has fixed structure, the json object is not used as HashMap))

(9) By ddevienne on 2023-10-11 09:24:09 in reply to 5 [link] [source]

You are confusing queries / extracting values from JSON document, and input/output of the whole document.

JSONB is faster than JSON in PostgreSQL for navigating the JSON document.

It's of course slightly slower than JSON on input/output,
because parses and indexes in part the document objects on input,
and re-generates the text on output (but not syntactically identical
in whitespace or keys in case of duplicates, or even order of keys).

While JSON stores the text as-is (and just checks well-formed-ness).

The PostgreSQL JSONB format is completely an internal implementation detail, not part of the API boundary.
Even the binary representation seen by clients is the re-generated text form (with an extra version byte prepended).

While SQLite's variant of JSONB can't be an implementation detail I guess.

(6) By anonymous on 2023-10-11 02:49:39 in reply to 1 [link] [source]

Looks good, although the many string types show an interesting focus. Minor typo "then only some quick top-level checks are performance."

(10) By ddevienne on 2023-10-11 09:29:19 in reply to 1 [link] [source]

When JSON was introduced, CBOR was discussed.

Since CBOR is a superset of JSON, and an actual IETF standard, why invent a new JSONB format specific to SQLite, rather than use a standard for a JSON-like binary representation of JSON?

Even perhaps restricting CBOR support to the subset that losslessly maps to JSON?

I guess maybe because of built-in indexing structures part of the format itself (obviously haven't read more details yet).

(11) By ddevienne on 2023-10-11 10:02:32 in reply to 1 [link] [source]

Now that I have read the doc, how about wasting a version byte at the start of the encoding, to allow for even more changes in the future?

You already accounted for some future expansion, via the reserved type codes, but 100% hindsight is hard, so that version byte might be a nice insurance policy against wanting to change one's mind, no?

This would be similar to PostgreSQL's binary representation of JSONB values, visible to libpq client code, which has such a similar version byte. So in the future, it could be an actual publicly documented format (and why not your newly devised format, why not! ;)).

I don't think the code mentions the format is lossy in terms of insignificant whitespace. Maybe it should be added?

It appears the code maintains perfect syntactic fidelity (except for whitespaces mentioned above),
in terms of key order, duplication, etc... By virtue of storing numerics as the original text?
But maybe not when escaping in string values happens? Again, the doc might be more explicit about this,
so exact round-tripping characteristics are known and clear.

Another idea, regarding repeating key names, a common occurrence in JSON documents,
would be some kind of back references to keys seen earlier in the document, with a back offset to that earlier value.
This could make it even smaller in terms of storage. It's a low hanging fruit for an aspiring JSONB format IMHO.

About the variations of representing the same value. Are you planning having a function to return a canonical encoding that guarantees the lowest possibly encoding length?

These are my first thoughts after going quickly through the current docs.

(12) By Wout Mertens (wmertens) on 2023-11-07 11:47:25 in reply to 11 [link] [source]

Perhaps objects can be stored like in JS VMs.

I believe that roughly the way JS VMs store objects is, each stored object is connected to a shape that tells you at which offset which value is stored for a given key.

When you reuse the same keys in the same order with the same types, you get the same shape. The shape determines where all the data is.

When you add a key to the end of an object, you can create a new shape that references back to the original shape, much like the prototype chain.

This way you're not storing copies of key names all the time.

SQLite could do something similar with an internal helper table that stores shapes, perhaps indexed by the shape's hash?

The tradeoffs are of course non-local data and garbage collection of unused shapes.

(13.1) By Vadim Goncharov (nuclight) on 2023-12-21 08:36:15 edited from 13.0 in reply to 1 [link] [source]

Strongly disagree for this NIH zoo of formats - there is IETF standard RFC for CBOR, which is designed to be stable for decades yet extensible with tags, for example, allowing even more compression for repeated keys.

I remember I was told in forumpost/4052572ca313022e that was a discussion about CBOR a long time ago - see at this thread that mentioned discussion is outdated due to those new CBOR tags. Also, entire CBOR ecosystem now exists, not only in embedded world but nowadays CBOR is present in e.g. even OpenSSH (Fido).

At the very, very minimum BLOB storage format must have a version disambiguating bytes - CBOR has an optional 3-byte 0xd9d9f7 magic number (also a tag, of course) for this purpose.

Second, a separate internal format may have some justification to exist if it offers something not possible with a generic interchange format - e.g. optimizations for speed of work which may be really useful for DB engine. However, looking at encoding format, I see that it:

has no speed up using binary values instead of ASCII ones (actual type could be just a back-convert hint instead of storage type)
has no random access-allowing layout

For example, standard record (row) header in fileformat2 first has varints with types and lengths and then actual payloads - allowing fast access by calculated offset to particular field. The draft JSONB document mentions "parse tree" for JSON, and, as in typical C implementations of trees we usually have

char *data;

member for storage outside of struct, this could be mapped to storage - first parse tree like header varints, then actual values - with added possibility to compress repeated strings "for free".

However, in it's current form proposed JSONB is just 4-and-4 bits instead of 3-and-5 bits of CBOR, offering no advantages over CBOR and less extensibility points (e.g. too many type codepoints already taken for text, not binary, versions).

(14) By Bo Lindbergh (_blgl_) on 2023-12-21 17:24:21 in reply to 13.1 [link] [source]

However, in it's current form proposed JSONB is just 4-and-4 bits instead of 3-and-5 bits of CBOR, offering no advantages over CBOR

Wrong. A JSONB element has a byte size in its header, letting you skip over it in constant time, no matter how complex it is. CBOR has no such thing, so you can't skip over an element without recursively parsing it all the way down to the leaves.

(15) By Vadim Goncharov (nuclight) on 2024-02-17 18:40:19 in reply to 14 [link] [source]

Not quite. CBOR has byte size for leaf elements. However, current JSONB format does not allow truly random access, so it does not really earn much over CBOR (and given storing integers as TEXT - probably very little in terms of performance).

(16) By Nuno Cruces (ncruces) on 2024-02-17 19:17:06 in reply to 15 [link] [source]

On JSONB each element of the hierarchy has a length, so it's possible for selectors to skip over arbitrarily complex branches.

Yes, it's not random access, but with non-uniform elements, that's hard anyway, unless you add indexes, which will consume extra space.

JSONB is a decent compromise at making most things strictly faster. There's enough room to add binary numbers to the spec.