SQLite Forum

sqlite3_bind_text16 & weird AI
Login
Re Why is the BOM removed?

> ... The logic to do that BOM removal was added back in 2004 and hasn't been touched much since then, as it has just worked. So we've been scratching our heads trying to remember out why we put it there. Your most recent post might have hit upon the answer: because if we have to convert the text to UTF8, we don't want the BOM to persist.

It appears to me that there are more compelling reasons, and I would be surprised if they were not considered some 16 years ago, as the issue deserves very careful attention and care for a wide-ranging set of use cases.

One good reason is simply to unify the representation.  It would be odd to have BOMs in some TEXT data and not in others.

Another good reason is cross-platform consistency.  If the database was UTF16XE and asked to store UTF16YE (where X != Y), because of the platform where the would-be-text originated, the BOM would have to be changed or removed, whereas when X == Y, the BOM would be kept.

Storage conservation counts too, even if it is not compelling. If all TEXT has the same encoding, who needs a BOM?

If SQLite was willing to store BOMs with all TEXT data, that might be useful. But then there would be complaints such as "I gave it plain, unadorned text and see these extra bytes stored with it!"

What the OP wants is something that acts like a BLOB but with the convenience of NUL-termination.  (I doubt that could work unless the encoding was known, so there would have to be more interface complication for that.)

I saw no deficiency in the docs on sqlite3\_bind\_text*(...). However, the whole subject of text encoding, and where recoding is needed, is a source of confusion for many. So if the SQLite docs will attempt to alleviate that, it would be better done in a separate page than in the interface specs. Those need to be succinct but complete, a challenge by itself. In that aside page (which could be linked from the interface specs on text binding), it would be useful to see stated a way of considering what problem the design solves and how to conceptualize what it does.

For myself, I deduced that TEXT means "text stored in a known encoding that I select at database creation" and that getting various text data from/into the database and into/from whatever encoding the application demands is my problem as a programmer with implementation assists from the SQLite API.

Having wrestled with encoding issues often, and come to dread their appearance, I appreciate the simplicity of SQLite's approach.  Its encoding is a known, and where some other encoding is needed, on input or output, that issue can be dealt with in isolation as demanded by local circumstances.  To me, that is much to be preferred over having to possibly recode at every place text is taken from or put into the database.