SQLite Forum

sqlite3_bind_text16 & weird AI
Login

sqlite3_bind_text16 & weird AI

(1) By anonymous on 2020-04-14 18:05:40 [link] [source]

sqlite3_bind_text16 internally inspects the passed UTF16 string and modifies it if first bytes look like BOM:

** This routine checks for a byte-order mark at the beginning of the 
** UTF-16 string stored in *pMem. If one is present, it is removed and
** the encoding of the Mem adjusted. This routine does not do any
** byte-swapping, it just sets Mem.enc appropriately.
**
** The allocation (static, dynamic etc.) and encoding of the Mem may be
** changed by this function.
SQLITE_PRIVATE int sqlite3VdbeMemHandleBom(Mem *pMem){
...
}

Why? If I pass "FEFFsomething" I expect to see exactly that in the db and get exactly that back later.

This weird logic is not mentioned anywhere in the documentation and it's not possible to turn it off.

(2) By Larry Brasfield (LarryBrasfield) on 2020-04-14 18:55:33 in reply to 1 [link] [source]

It is unclear why you, as a SQLite user, are concerned with what a private function does.

As for your expectation, it would be better matched to the BLOB type. That is what will ensure that what you put in is exactly what you can later get out.

Once a (SQLite) is database created, its text encoding scheme is fixed. We should consider it a convenience that it is possible to present, for storage in the database's encoding, text with different encodings. The private function you found is part of a collection that allows that service, along with other coding changes that become necessary.

(3) By anonymous on 2020-04-15 12:44:57 in reply to 2 [link] [source]

It is unclear why you, as a SQLite user, are concerned with what a private function does.

Because the reality is different from my expectations, so I look into the implementation to understand what is going on and provide a better bug report than "HALP! SQLite corrupts my data".


We should consider it a convenience that it is possible to present, for storage in the database's encoding, text with different encodings.

I totally agree that the implementation should perform any necessary conversions between UTF-8, UTF-16LE, and UTF-16BE automatically when reading from the database or when writing into it.

However, why do you convert user-provided API strings if they contain particular byte sequences? It makes no sense. APIs work with the native byte order by definition.

Quoting the documentation:

The new API for SQLite 3.0 contains routines that accept text as both UTF-8 and UTF-16 in the native byte order of the host machine".

Note the "native byte order" part.

My host is Windows.

The native byte order is LE.

I pass LE const wchar_t* strings.

Why the implementation sometimes treats them as BE and corrupts my data?


it would be better matched to the BLOB type

Quoting the same article again:

SQLite is not particular about the text it receives and is more than happy to process text strings that are not normalized or even well-formed UTF-8 or UTF-16. ... the byte sequence of the text will not be modified in any way

As a SQLite user, I pass UTF-16LE string "\feff123" (FEFF 0031 0032 0033).

This is a well-formed UTF-16LE string.

SQLite sees the FEFF, thinks that it's the UTF-16LE BOM, removes the FEFF, treats the rest as UTF-16LE and converts to UTF-8 "123" (31 32 33).

The data is corrupted.


As a SQLite user, I pass UTF-16LE string "\fffe123" (FFFE 0031 0032 0033).

This is not a well-formed UTF-16LE string, but so what? The quote above explicitly says that it's ok.

SQLite sees the FFFE, thinks that it's the UTF-16BE BOM, removes the FFFE, treats the rest as UTF-16BE and converts to UTF8 "㄀㈀㌀" (E3 84 80 E3 88 80 E3 8C 80).

The data is corrupted.


As a SQLite user, I pass UTF-16LE string "\ffff" (FFFF).

This is not a well-formed UTF-16LE string, but again, so what?

SQLite converts it not to UTF8 "￿" (EF BF BF) as one might expect from the promise above, but to UTF8 "�" (EF BF BD), aka "REPLACEMENT CHARACTER".

The data is corrupted.


So, is it really "not particular" and the text will "not be modified in any way"?

(4) By Richard Hipp (drh) on 2020-04-15 13:38:10 in reply to 3 [link] [source]

Every SQLite database has a single encoding (UTF8, UTF16LE, or UTF16BE) in which it stores all text. If your database is a UTF8 database, then regardless of what the encoding is when you pass it in, that encoding will be converted into UTF8. If you want to store UTF16LE in a UTF8 database, you'll need to store your UTF16LE text as a BLOB, not as TEXT.

If you create your database as a UTF16LE database, then it will store your UTF16LE text as UTF16LE. It will still strip off the Byte Order Mark (BOM). But it won't do any other conversion. If you have a UTF16LE database, then all text will be stored as UTF16LE in that database, including any text that you try to insert as UTF8 and UTF16BE. All text is converted into UTF16LE for storage.

Why is the BOM removed?

Your original post asked why the BOM was being removed. The logic to do that BOM removal was added back in 2004 and hasn't been touched much since then, as it has just worked. So we've been scratching our heads trying to remember out why we put it there. Your most recent post might have hit upon the answer: because if we have to convert the text to UTF8, we don't want the BOM to persist.

(5) By anonymous on 2020-04-15 15:09:27 in reply to 4 [link] [source]

Every SQLite database has a single encoding (UTF8, UTF16LE, or UTF16BE) in which it stores all text. If your database is a UTF8 database, then regardless of what the encoding is when you pass it in, that encoding will be converted into UTF8. If you want to store UTF16LE in a UTF8 database, you'll need to store your UTF16LE text as a BLOB, not as TEXT.

Thanks, all that is logical, but, again, it's not the issue.

I don't want to store UTF-16LE in the database. I don't even care how exactly it is stored.

All I want is to pass a UTF-16 string in the native byte order to the UTF-16 API that accepts UTF-16 strings in the native byte order (according to the documentation) and get sensible results. If you must convert it to UTF-8 behind the curtain - so be it, I don't care - all UTF conversions are lossless by definition.

The problem is, instead of being "not particular about the text" in my UTF-16 strings (as promised) SQLite does all sorts of weird things with them:

  • it looks for the BOM and treats strings as LE or BE dynamically, depending on the BOM, instead of the static native byte order of the host, causing data corruption due to invalid conversions to UTF-8.

  • it removes the BOM, causing data loss.

  • it validates the UTF-16 strings and tries to replace ill-formed sequences with the REPLACEMENT CHARACTER, causing data corruption.

It will still strip off the Byte Order Mark (BOM). But it won't do any other conversion.

It does. Step-by-step if you don't believe me:

  1. I invoke sqlite3_bind_text16(..., zData="\xFFFE SomeText", ...).
  2. sqlite3_bind_text16 invokes bindText(..., zData, encoding=SQLITE_UTF16NATIVE). SQLITE_UTF16NATIVE = SQLITE_UTF16LE in my case.
  3. bindText invokes sqlite3VdbeMemSetStr(..., zData, ..., encoding, ...).
  4. sqlite3VdbeMemSetStr initialises Mem with the given string and then invokes sqlite3VdbeMemHandleBom.
  5. sqlite3VdbeMemHandleBom finds the UTF-16BE BOM, removes it and sets pMem->enc to SQLITE_UTF16BE.
  6. bindText invokes sqlite3VdbeChangeEncoding(pVar, desiredEnc=ENC(p->db)). ENC(p->db) is SQLITE_UTF8.
  7. sqlite3VdbeChangeEncoding invokes sqlite3VdbeMemTranslate(pMem, desiredEnc=SQLITE_UTF8).
  8. sqlite3VdbeChangeEncoding enters the "/* UTF-16 Big-endian -> UTF-8 */" block and invokes READ_UTF16BE + WRITE_UTF8 in a loop, ultimately converting the UTF-16LE string to UTF-8 as if it was UTF-16BE.

(6) By Richard Hipp (drh) on 2020-04-15 15:21:24 in reply to 5 [source]

it looks for the BOM and treats strings as LE or BE dynamically

That is the whole purpose of a BOM, isn't it? Are you saying that SQLite (or any other system for that matter) should take a string that self-identifies as UTF16BE but go ahead and treat it as if it where in native byte order (UTF16LE)? In other words, pretend the BOM doesn't exist, or that it is just another unicode character?

It don't think it is advisable to change SQLite to work that way. Even apart from the backwards-compatibility issues that such a change would provoke, it seems like the wrong behavior.

If you goal is to store something that you never want SQLite to reformat, then store that something as a BLOB, not as TEXT.

I will take the action to update the documentation to try to make these points clearer.

(8) By anonymous on 2020-04-15 16:36:43 in reply to 6 [link] [source]

That is the whole purpose of a BOM, isn't it?

It definitely is, when we don't control the encoding and don't know what is there. For example, when we open a text file from the internet in a text editor.

However, here we do control the encoding - it's in the interface. The interface is a contract and we can expect that the caller abides to that contract. The string doesn't come from an unknown host, working on unknown OS and unknown hardware with an unknown byte order. It comes from the very same process with the same byte order that doesn't change randomly in runtime between calls.

Are you saying that SQLite (or any other system for that matter) should take a string that self-identifies as UTF16BE but go ahead and treat it as if it where in native byte order (UTF16LE)?

Yes, that's what I'm saying, that's what your documentation is saying and that's what all the other systems do. Does, say, wcscat try to determine the byte order in passed wchar_ts by the BOM?

In other words, pretend the BOM doesn't exist, or that it is just another unicode character?

It totally depends on the context.

If I'm writing a text editor - of course I will inspect the BOM, determine the file encoding and won't show the BOM to the user, same as with \r, \n, \t etc.

However, if I'm implementing a low level library, I know that any incoming strings are in the native byte order by default and by definition. If a string comes in a wrong byte order or encoding - it's the caller's problem, not mine. I may provide a way to specify encoding explicitly (as you do with sqlite3_bind_text64) if there's a demand, but won't try to determine it from the string's content.

Speaking of other systems (and why I'm here at all): Windows API, for example, natively works with UTF-16LE, but doesn't validate or give any special treatment to anything. Users can create files and directories with BOMs, ill-formed surrogate pairs and reserved codepoints, whatever. The same situation in Linux with UTF-8, and, I believe, in any other OS. Obviously, any applications that use SQLite won't be able to work with such files properly.

I will take the action to update the documentation to try to make these points clearer.

Thanks. I'm not convinced that the current behavior is reasonable, but backwards compatibility is a good reason to keep the status quo. I've already solved my issue (by not using any UTF-16 SQLite APIs and converting UTF-16 strings to UTF-8 manually), hopefully a better description of the conversion logic and potential gotchas could save someone else's time in the future.

(7.1) By Larry Brasfield (LarryBrasfield) on 2020-04-15 16:16:30 edited from 7.0 in reply to 4 [link] [source]

Re Why is the BOM removed?

... The logic to do that BOM removal was added back in 2004 and hasn't been touched much since then, as it has just worked. So we've been scratching our heads trying to remember out why we put it there. Your most recent post might have hit upon the answer: because if we have to convert the text to UTF8, we don't want the BOM to persist.

It appears to me that there are more compelling reasons, and I would be surprised if they were not considered some 16 years ago, as the issue deserves very careful attention and care for a wide-ranging set of use cases.

One good reason is simply to unify the representation. It would be odd to have BOMs in some TEXT data and not in others.

Another good reason is cross-platform consistency. If the database was UTF16XE and asked to store UTF16YE (where X != Y), because of the platform where the would-be-text originated, the BOM would have to be changed or removed, whereas when X == Y, the BOM would be kept.

Storage conservation counts too, even if it is not compelling. If all TEXT has the same encoding, who needs a BOM?

If SQLite was willing to store BOMs with all TEXT data, that might be useful. But then there would be complaints such as "I gave it plain, unadorned text and see these extra bytes stored with it!"

What the OP wants is something that acts like a BLOB but with the convenience of NUL-termination. (I doubt that could work unless the encoding was known, so there would have to be more interface complication for that.)

I saw no deficiency in the docs on sqlite3_bind_text*(...). However, the whole subject of text encoding, and where recoding is needed, is a source of confusion for many. So if the SQLite docs will attempt to alleviate that, it would be better done in a separate page than in the interface specs. Those need to be succinct but complete, a challenge by itself. In that aside page (which could be linked from the interface specs on text binding), it would be useful to see stated a way of considering what problem the design solves and how to conceptualize what it does.

For myself, I deduced that TEXT means "text stored in a known encoding that I select at database creation" and that getting various text data from/into the database and into/from whatever encoding the application demands is my problem as a programmer with implementation assists from the SQLite API.

Having wrestled with encoding issues often, and come to dread their appearance, I appreciate the simplicity of SQLite's approach. Its encoding is a known, and where some other encoding is needed, on input or output, that issue can be dealt with in isolation as demanded by local circumstances. To me, that is much to be preferred over having to possibly recode at every place text is taken from or put into the database.

(9) By anonymous on 2020-10-28 07:02:21 in reply to 4 [link] [source]

Hi Richard,

I just came across this thread trying to figure out how to check if a string value is stored in UTF-8 or UTF-16, before accessing it with either sqlite3_column_text or sqlite3_column_text16, in order to avoid any conversions.

From your post I now know that all text in a database is stored in a single encoding. This is very useful information, and it could be documented more clearly, possibly in the pragma encoding section.