SQLite Forum

Comment about application file format
Login
The [document about application file format](https://sqlite.org/appfileformat.html) says that Microsoft Office format is a "fully custom" format and is not a wrapped pile of files; actually, Microsoft Office format does use a wrapped pile of files, although it isn't ZIP (it uses "Microsoft Compound" format; 7-Zip can still read such files). The format of the pieces is still a custom format though (although that is true for many "pile-of-files" and "wrapped-pile-of-files" format, too)

There are other variants, too, such as "pile-of-wrapped-pile-of-files", "SQL-of-wrapped-pile-of-files", "wrapped-pile-of-wrapped-pile-of-files", etc. Also, for some uses, text file is useful that you can edit using a text editor. There is also other structured data formats that other programs might use, such as CSV, JSON, XML, RDF, RELOAD, PostScript tokens (which is available both as a text and as a binary format; there is another message in this forum with a SQLite extension to produce output in the PostScript text format), etc.

There are also other archive formats, including SQLite, tape archive, and Hamster archive (consists of a sequence of lumps, where each lump consists of the null-terminated ASCII file name, the 32-bit PDP-endian data size, and then the data; that is the entire specification). However, ZIP has the advantage that EPUB files can be displayed in web browsers with the `jar:` URI scheme (which Firefox does, although you will have to save the file and then write the URL by yourself; it will not do that automatically).

SQLite does have many advantages (as well as a few problems).

The use of SQL queries is helpful, even to allow it to be exposed to the user too, in addition to the internal use by the program (if the program uses it, then it can be available that the end user also can use it, without needing to add another programming language too). This can be useful even if the application file format isn't SQLite, or if it is but it also uses some data which isn't, by the use of virtual tables.

The 32-bit application ID number is too short than a UUID. (Also, I am not sure how to propose adding new things to the list of application ID numbers into the `magic.txt`; I and some others wrote on this forum and mail list before and had not gotten a response about such things.)

Using separate journal files, etc doesn't seem the best way to me (since if the application file is copied, it might be copied improperly if there is a hot journal, or if the database is currently in use). I do have some ideas about how to put it in a single file by storing changed stuff on free pages, and using a new write version number but keeping read version number 1 except while a transaction is being committed; recovery will only be needed if it crashes during a commit, otherwise it is already rolled back (the new data may be present in the database, but it is on free pages, so they will be overwritten by any new data that will be written), so a `ROLLBACK` command simply unlocks the database and discards the in memory cache data, making it very fast. The disadvantage of this is that the database file might be up to twice as big (depending on how much data is changed in a single transaction) unless it is vacuumed.

The use of Unicode text is problematic I think, since it is inappropriate for many applications (no single character set is good for all applications, even though they keep insisting Unicode; a few people prefer TRON, which has advantages and disadvantages compared with Unicode). Although you can store blobs, custom collation does not work with blobs, and the built-in SQLite text functions assume that you are working with Unicode, even if you aren't. So, this requires making a mess in order to get it to work properly with anything other than UTF-8, UTF-16, or ASCII text (which is a subset of UTF-8) that lacks null characters; sometimes a blob will work, and other times, "false encoding" (i.e. treat the input as ISO-8859-1 (even though it isn't) and convert it to UTF-8 except that nulls become Unicode code point 256 instead of 0) is what I have considered (although it is messy, it is simpler than trying to convert to Unicode, which is often impossible anyways (despite what they say)). So, if you want to use TRON character encoding, PC character encoding, UTF-32, APL, etc, it doesn't do. (I know, because I have experience in programming these things.)

Also, some programming languages don't have SQLite, such as PostScript (which doesn't have good FFI anyways, unfortunately; you would have to communicate with an external program (possibly by using `printobject` or `writeobject`, which is a structured format which is fortunately is simple to implement in other programming languages (perhaps 10-20 lines of C `#define`s) even without having to parse the entire structure (it can be used in memory as is), but hardly any program expects such a format)), and probably many others that I have forgotten or am unaware of (I am glad that it is implemented in C, since I like the C programming language; its worst feature is probably the confusing syntax for types, I think). (Although, this is not really a problem with SQLite itself.)