Base85 and base64, Yet more text/blob conversions

(1.3) By Larry Brasfield (larrybr) on 2022-11-19 03:28:34 edited from 1.2 [source]

(Edited to reflect sources going into the SQLite repo, and another converter.)

Sources for the new extensions mentioned below are in this directory. They will soon be well tested, and may undergo some improvement. However, the interfaces will be stable. A checking function for base85 might become pickier and one for base64 may appear.

After seeing Mr. Medcalf's base64 extension recently, and inspired by its automatic conversion direction in a single-argument function, I decided to resurrect some code I wrote about 38 years ago when I got frustrated having my files molested while passing through various "ASCII channels". (That was a time when uuencode and uudecode were popular.) A SQLite extension can be built from the source, base85.c, and loaded to provide a base85(x) function.

Using something like base85(someBlob), text can be written to CSV that represents the blob content without going outside of the assumptions underlying the CSV format.

Then, once that same CSV has been read in, to populate text fields of a (likely temporary) table, those fields can be made back into blobs using something like base85(someBase85).

By default, there is also a function, is_base85(allegedBase85), which returns 1 iff its text argument looks like bona fide base85 with whitespace, otherwise 0.

I do not claim that this compares well to sliced bread, but it can be used to solve certain problems that have brought some folks to this forum. It's free to use for any purpose.

Another source, base64.c, is a (now) public domain SQLite extension to create a function, base64(x), which converts between blob and base64¹ It behaves nearly as does the like-named extension Mr. Medcalf shared recently.² I make no claim that this new code is better -- only that it is free of licensing restrictions.³

^{^} This is base64 as specified in RFC 4648.
^{^} It may be slower, or faster, or more or less robust, as further testing will reveal.
^{^} It is an independent implementation, so there is no copy-left or requirement to propagate attributions, copyright messages, etc.

(2) By anonymous on 2022-11-17 12:05:01 in reply to 1.0 [link] [source]

Can't get it to download, something appears to be blocking something where I am currently at.

The closest I could get was for `page 1/1` to be shown

Will try again tomorrow, unless you have a copy somewhere else, thanks!

(3) By Larry Brasfield (larrybr) on 2022-11-17 15:01:44 in reply to 2 [link] [source]

Sorry about that. Should be fixed now with links to a different cloud service in edited post #1.1 .

(4) By Holger J (holgerj) on 2022-11-18 12:01:51 in reply to 1.1 [link] [source]

Since SQLite3 allows saving binary data to a column, it is not necessary to convert to any kind of text format.

(5) By Ryan Smith (cuz) on 2022-11-18 12:28:29 in reply to 4 [link] [source]

Sure, it's also not necessary to have a flushing toilet since you are born with one or more hands and an amazing ability to dig a hole in the ground...

But sometimes I want to have my cake and eat it, such as when exporting a column to CSV or other text format, or perhaps importing from it. Nice to have luxurious Base64 and Base85 options.

(6.1) By ddevienne on 2022-11-18 13:03:44 edited from 6.0 in reply to 5 [link] [source]

Of topic, but I like base62 myself :)

Stays pure alphanum, w/o needing other characters, which tend to vary. See all the variants of base64 that change the only two non-alphanum characters.

I prefer base62 mostly over base16 (i.e. hexa) for identifiers. That's almost 6-bits per-char over 4-bit per-char, yielding smaller idents.

E.g. a UUID is 22 chars, instead of 32 or 36 in hexa (w/o and w/ dashes). FWIW.

PS: Which BTW in most C++11 (and later) std::string implementations avoids heap-allocation.

(11) By Vadim Goncharov (nuclight) on 2023-01-29 14:30:42 in reply to 6.1 [link] [source]

But why? base62 requires harder arithmetic than base64 and gives longer strings. The wider charset is, the better, let's recall yEnc!

(12) By ddevienne on 2023-01-30 11:58:00 in reply to 11 [link] [source]

Sure, but we are talking peanuts though, since both base64 and base62 can encode millions of uuids per second. It is slower (I measured it), but not enough that it matters (to me at least).

The facts the encoding is pure alnum is the main benefit, very useful when used for file names, or RDBMS names. Ensures no special-char incompatibilities, or no quoting necessary in SQL, respectively.

But to each his/her own. I'm not forcing anyone.

(13) By Vadim Goncharov (nuclight) on 2023-03-05 15:00:25 in reply to 12 [link] [source]

Then it will be funny to know there exists base45 now, with all those problems again... https://datatracker.ietf.org/doc/html/rfc9285

(14) By Keith Medcalf (kmedcalf) on 2023-03-05 15:25:57 in reply to 13 [link] [source]

You could encode it in base256. Interestingly, the encoded and unencoded forms would be identical (on a binary computer using 8-bit bytes).

(7) By anonymous on 2022-11-20 09:42:01 in reply to 1.3 [link] [source]

I used to use base64 from cURL but will definitely switch to these, will be using both base64 and base85.

Much appreciated!!

Looked at https://en.wikipedia.org/wiki/Base85 and scrolled over uuencode, zmodem... memorable times...

Philip

(8.3) By Larry Brasfield (larrybr) on 2022-11-28 17:03:08 edited from 8.2 in reply to 1.3 [link] [source]

Just a warning: These sources are still in beta as I get them tested. This work has already found a failure of the base64 decode. I'll update this post when I am reasonably sure (after getting testing in place and the code well exercised) that these extensions are solid.

The encode/decode functions are now able to pass my 2-way translation tests with a wide variety of inputs, as of this checkin. I still intend to test a variety of malformed base{64,85}() inputs. But base{64,85}(blob) outputs are fully convertible back to the same blob.

(Added 22 Nov. 2022:) The encode/decode functions are now reasonably well tested, treating garbage in as promised in the source headers, round-tripping a variety of corner-case and random blobs, and verifying the handling of missing pad characters for base64() decode. These tests assure that base64 (per RFC 4648) is accepted and decoded correctly whether or not the extension did the encoding.

(Added 28 Nov. 2022:) These encode/decode functions are now available in a trunk check-out as ./ext/misc/base{64,85,xx}.c (3 files.) They are reasonably speedy and robust. They are not yet built into the CLI (or the library) and may never be. The basexx.c source can be used (in place, alongside base{64,85}.c), to build a single extension which contains both base64(x) and base85(x) UDFs. See its head comment for build tips.

(9) By Keith Medcalf (kmedcalf) on 2022-11-28 15:10:29 in reply to 8.2 [link] [source]

These are not going to be added in the wrong place are they? If they are going to be added by default they should be added to the core, not to the shell. There is a lot of crap that has been added in the wrong place and it is a pain to remove it from the wrong place and add it to the right place.

(10) By Larry Brasfield (larrybr) on 2022-11-28 16:05:19 in reply to 9 [link] [source]

That's an interesting question and issue, for which I will start a new thread.