Comment about application file format

(1) By anonymous on 2021-08-16 16:44:02 [link] [source]

The document about application file format says that Microsoft Office format is a "fully custom" format and is not a wrapped pile of files; actually, Microsoft Office format does use a wrapped pile of files, although it isn't ZIP (it uses "Microsoft Compound" format; 7-Zip can still read such files). The format of the pieces is still a custom format though (although that is true for many "pile-of-files" and "wrapped-pile-of-files" format, too)

There are other variants, too, such as "pile-of-wrapped-pile-of-files", "SQL-of-wrapped-pile-of-files", "wrapped-pile-of-wrapped-pile-of-files", etc. Also, for some uses, text file is useful that you can edit using a text editor. There is also other structured data formats that other programs might use, such as CSV, JSON, XML, RDF, RELOAD, PostScript tokens (which is available both as a text and as a binary format; there is another message in this forum with a SQLite extension to produce output in the PostScript text format), etc.

There are also other archive formats, including SQLite, tape archive, and Hamster archive (consists of a sequence of lumps, where each lump consists of the null-terminated ASCII file name, the 32-bit PDP-endian data size, and then the data; that is the entire specification). However, ZIP has the advantage that EPUB files can be displayed in web browsers with the jar: URI scheme (which Firefox does, although you will have to save the file and then write the URL by yourself; it will not do that automatically).

SQLite does have many advantages (as well as a few problems).

The use of SQL queries is helpful, even to allow it to be exposed to the user too, in addition to the internal use by the program (if the program uses it, then it can be available that the end user also can use it, without needing to add another programming language too). This can be useful even if the application file format isn't SQLite, or if it is but it also uses some data which isn't, by the use of virtual tables.

The 32-bit application ID number is too short than a UUID. (Also, I am not sure how to propose adding new things to the list of application ID numbers into the magic.txt; I and some others wrote on this forum and mail list before and had not gotten a response about such things.)

Using separate journal files, etc doesn't seem the best way to me (since if the application file is copied, it might be copied improperly if there is a hot journal, or if the database is currently in use). I do have some ideas about how to put it in a single file by storing changed stuff on free pages, and using a new write version number but keeping read version number 1 except while a transaction is being committed; recovery will only be needed if it crashes during a commit, otherwise it is already rolled back (the new data may be present in the database, but it is on free pages, so they will be overwritten by any new data that will be written), so a ROLLBACK command simply unlocks the database and discards the in memory cache data, making it very fast. The disadvantage of this is that the database file might be up to twice as big (depending on how much data is changed in a single transaction) unless it is vacuumed.

The use of Unicode text is problematic I think, since it is inappropriate for many applications (no single character set is good for all applications, even though they keep insisting Unicode; a few people prefer TRON, which has advantages and disadvantages compared with Unicode). Although you can store blobs, custom collation does not work with blobs, and the built-in SQLite text functions assume that you are working with Unicode, even if you aren't. So, this requires making a mess in order to get it to work properly with anything other than UTF-8, UTF-16, or ASCII text (which is a subset of UTF-8) that lacks null characters; sometimes a blob will work, and other times, "false encoding" (i.e. treat the input as ISO-8859-1 (even though it isn't) and convert it to UTF-8 except that nulls become Unicode code point 256 instead of 0) is what I have considered (although it is messy, it is simpler than trying to convert to Unicode, which is often impossible anyways (despite what they say)). So, if you want to use TRON character encoding, PC character encoding, UTF-32, APL, etc, it doesn't do. (I know, because I have experience in programming these things.)

Also, some programming languages don't have SQLite, such as PostScript (which doesn't have good FFI anyways, unfortunately; you would have to communicate with an external program (possibly by using printobject or writeobject, which is a structured format which is fortunately is simple to implement in other programming languages (perhaps 10-20 lines of C #defines) even without having to parse the entire structure (it can be used in memory as is), but hardly any program expects such a format)), and probably many others that I have forgotten or am unaware of (I am glad that it is implemented in C, since I like the C programming language; its worst feature is probably the confusing syntax for types, I think). (Although, this is not really a problem with SQLite itself.)

(2) By Warren Young (wyoung) on 2021-08-16 19:58:49 in reply to 1 [link] [source]

Microsoft Office format does use a wrapped pile of files

Perhaps drh was thinking of the pre-OOXML formats. It does say "DOC" rather than "DOCX" and so forth. That interpretation is somewhat untenable, though, given that the referenced document only goes back to 2014.

it isn't ZIP

OOXML most definitely is ZIP-based.

I just created a PPTX file in PowerPoint 2019, and "unzip -t" reads the resulting document successfully. From my reading about this CFBF format, a file in that format shouldn't pass this test. CFBF is reportedly the basis of the MSI format, for example, and the unzip test doesn't pass on an MSI file I created recently.

How do I get Office to produce a CFBF-based file, one that fails this unzip test?

The relevance, of course, is toward the referenced SQLite doc: until we know what it takes to produce a CFBF based doc, it doesn't seem useful to be talking about it in that doc, except perhaps in reference to MSI.

I am not sure how to propose adding new things to the list of application ID numbers into the magic.txt

It depends on which implementation of file(1) that comes from, but statistically, it's likely to be this one. You may find the core of this project referenced as "libmagic", since it has wrappers other than file(1).

Using separate journal files, etc doesn't seem the best way to me

Those are only created when the DB is in WAL mode. Solution: don't use WAL mode.

a few people prefer TRON

I hadn't even heard of TRON encoding before today.

Who are these "few people" that they have any hope of countering the three decades of inertia behind Unicode?

At minimum, I think it'd take a supremely popular new OS that used TRON by default to wag this dog, this late in the game.

PostScript…doesn't have good FFI…not really a problem with SQLite itself

Yes, so why is it any objection here?

Never mind the specifics: you seem to be using these extremely niche cases to argue against SQLite as an application format. So fine, it doesn't work in those niches. How does that argue the broader point?

I once participated in an Internet flame thread where someone tried to argue that the adage "there's always someone worse off than you" was wrong by pointing out that you could be the worst-off person in the world. And yes, for one in 7.5 billion people, that's true, but to take that position, you're arguing a one-in-7.5-billion-against position.

Is the point here to be "technically right", even if that means being wrong in almost every practical case?

(3) By RandomCoder on 2021-08-16 20:19:24 in reply to 2 [source]

How do I get Office to produce a CFBF-based file, one that fails this unzip test?

You can save in one of the older formats, for instance in PowerPoint, go to Save As, click Browse, and select PowerPoint 97-2003 Presentation (*.ppt) in the type selection dropdown in the file dialog.

I've seen this format used in some professional video production pipelines as well. Still, they have very specific requirements there, so it makes a little sense.

(4) By Warren Young (wyoung) on 2021-08-16 21:10:13 in reply to 3 [link] [source]

Are you saying that CFBF isn't a post-OOXML format, it's what preceded the ZIP-based OOXML formats?

(5) By RandomCoder on 2021-08-16 21:16:39 in reply to 4 [link] [source]

Correct. The history of it continues on for other use cases, but at least as far Office is concerned (excluding Access, I think), it's no longer used as the primary format.