SQLite Forum

sqlite3_bind_text16 & weird AI
Login
> Every SQLite database has a single encoding (UTF8, UTF16LE, or UTF16BE) in which it stores all text. If your database is a UTF8 database, then regardless of what the encoding is when you pass it in, that encoding will be converted into UTF8. If you want to store UTF16LE in a UTF8 database, you'll need to store your UTF16LE text as a BLOB, not as TEXT.

Thanks, all that is logical, but, again, it's not the issue.

I don't want to store UTF-16LE in the database.
I don't even care how exactly it is stored.

All I want is to pass a UTF-16 string in the native byte order to the UTF-16 API that accepts UTF-16 strings in the native byte order (according to the documentation) and get sensible results.
If you must convert it to UTF-8 behind the curtain - so be it, I don't care - [all UTF conversions are lossless by definition](http://www.unicode.org/faq/utf_bom.html#gen1).

The problem is, instead of being "not particular about the text" in my UTF-16 strings ([as promised](https://www.sqlite.org/version3.html)) SQLite does all sorts of weird things with them:

- it looks for the BOM and treats strings as LE or BE *dynamically*, depending on the BOM,  instead of the static native byte order of the host, causing data corruption due to invalid conversions to UTF-8.

- it removes the BOM, causing data loss.

- it validates the UTF-16 strings and tries to replace ill-formed sequences with the [REPLACEMENT CHARACTER](http://www.fileformat.info/info/unicode/char/fffd/index.htm), causing data corruption.


> It will still strip off the Byte Order Mark (BOM). But it won't do any other conversion.

It does. Step-by-step if you don't believe me:

1. I invoke `sqlite3_bind_text16(..., zData="\xFFFE SomeText", ...)`.
2. `sqlite3_bind_text16` invokes `bindText(..., zData, encoding=SQLITE_UTF16NATIVE)`. `SQLITE_UTF16NATIVE = SQLITE_UTF16LE` in my case.
3. `bindText` invokes `sqlite3VdbeMemSetStr(..., zData, ..., encoding, ...)`.
4. `sqlite3VdbeMemSetStr` initialises `Mem` with the given string and then invokes `sqlite3VdbeMemHandleBom`.
5. `sqlite3VdbeMemHandleBom` finds the UTF-16BE BOM, removes it and sets `pMem->enc` to `SQLITE_UTF16BE`.
6. `bindText` invokes `sqlite3VdbeChangeEncoding(pVar, desiredEnc=ENC(p->db))`. `ENC(p->db)` is `SQLITE_UTF8`.
7. `sqlite3VdbeChangeEncoding` invokes `sqlite3VdbeMemTranslate(pMem, desiredEnc=SQLITE_UTF8)`.
8. `sqlite3VdbeChangeEncoding` enters the `"/* UTF-16 Big-endian -> UTF-8 */"` block and invokes `READ_UTF16BE` + `WRITE_UTF8` in a loop, ultimately converting the UTF-16LE string to UTF-8 *as if it was UTF-16BE*.