SQLite Forum

sqlite3_bind_text16 & weird AI
Login
> That is the whole purpose of a BOM, isn't it?

It definitely is, when we don't control the encoding and don't know what is there. For example, when we open a text file from the internet in a text editor.

However, here we do control the encoding - it's in the interface. The interface is a contract and we can expect that the caller abides to that contract. The string doesn't come from an unknown host, working on unknown OS and unknown hardware with an unknown byte order. It comes from the very same process with the same byte order that doesn't change randomly in runtime between calls.


> Are you saying that SQLite (or any other system for that matter) should take a string that self-identifies as UTF16BE but go ahead and treat it as if it where in native byte order (UTF16LE)?

Yes, that's what I'm saying, that's what your documentation is saying and that's what all the other systems do. Does, say, [wcscat](https://en.cppreference.com/w/c/string/wide/wcscat) try to determine the byte order in passed `wchar_t`s by the BOM?

> In other words, pretend the BOM doesn't exist, or that it is just another unicode character?

It totally depends on the context.

If I'm writing a text editor - of course I will inspect the BOM, determine the file encoding and won't show the BOM to the user, same as with `\r`, `\n`, `\t` etc.

However, if I'm implementing a low level library, I know that any incoming strings are in the native byte order by default and by definition. If a string comes in a wrong byte order or encoding - it's the caller's problem, not mine. I may provide a way to specify encoding explicitly (as you do with `sqlite3_bind_text64`) if there's a demand, but won't try to determine it from the string's content.


Speaking of other systems (and why I'm here at all): Windows API, for example, natively works with UTF-16LE, but doesn't validate or give any special treatment to anything. Users can create files and directories with BOMs, ill-formed surrogate pairs and reserved codepoints, whatever. The same situation in Linux with UTF-8, and, I believe, in any other OS.
Obviously, any applications that use SQLite won't be able to work with such files properly.


> I will take the action to update the documentation to try to make these points clearer.

Thanks. I'm not convinced that the current behavior is reasonable, but backwards compatibility is a good reason to keep the status quo. I've already solved my issue (by not using any UTF-16 SQLite APIs and converting UTF-16 strings to UTF-8 manually), hopefully a better description of the conversion logic and potential gotchas could save someone else's time in the future.