length function for BLOB out of sync with documentation

(1) By troy (troyski) on 2020-04-11 12:20:24 [link] [source]

The documentation for the length function at https://www.sqlite.org/lang_corefunc.html#length says that "For a blob value X, length(X) returns the number of bytes in the blob."

However, when a BLOB contains valid UTF-8 sequences, length returns the number of characters, not bytes. This means that LENGTH(column) is less than the number of actual bytes returned from SELECT. This happens even if the value in the BLOB is mostly invalid UTF-8. In that case only the valid UTF-8 sequences get converted to characters before length is calculated.

Since BLOB is intended for arbitrary data, it would seem incongruous for the length function to process it as potential unicode.

(2) By Stephan Beal (stephan) on 2020-04-11 12:31:01 in reply to 1 [link] [source]

Remember that the type declaration for sqlite fields is not rigid. It sounds like maybe you bound/saved the data as text, then tried to treat it like a blob.

(3.3) By Keith Medcalf (kmedcalf) on 2020-04-11 15:40:27 edited from 3.2 in reply to 1 [link] [source]

However, when a BLOB contains valid UTF-8 sequences, length returns the number of characters, not bytes.

This is incorrect. Each value has a type and you can find out the type of X by using the typeof(X) function.

If typeof(X) == 'blob' then length(X) is the number of bytes in X.

If typeof(X) == 'text' then length(X) is the number of characters in X.

If typeof(X) == 'null' then length(X) is null.

If typeof(X) in ('integer', 'real') then the value is cast to text (ie, as if printed) and the number of characters in the resulting text returned.

The declared affinity of X does not mean that any particular instance of X is of that preferred type. Each instance of X may have a different type which you may determine using various API methods or the typeof(X) SQL function (which returns a text string of 'blob', 'text', 'integer', 'real', 'null').

It is likely that the typeof(X) is not what you believe it to be.

(7) By troy (troyski) on 2020-04-15 09:20:28 in reply to 3.3 [link] [source]

My data is binary data, not characters, and I had set the column type to BLOB. It only happened to contain some valid UTF-8 sequences. When I now look at the column's values with typeof(), I indeed get TEXT as the type.

Therefore the problem lies in the INSERT/UPDATE ops. I.e. you are right; the length() function works correctly. I made bad assumptions.

So my real problem is inserting/updating data and ensuring it's processed as a BLOB. I access the db through ODBC, since its used by different programs (python3+pyodbc and a set of C functions which use the unixodbc driver). I would prefer to use SQL to do the INSERTs and UPDATEs, but it is not out of the question to tweak the ODBC commands (SQLPutData et al). If I write a function which makes sure that the ODBC driver gets the data as long data (BLOB), will sqlite3 then understand it to be of type BLOB? Or is there some other tip you could give me? Of course it all depends on how libsqliteodbc and unixodbc handle BLOBs.

I totally get this question is now getting a bit out of hand. I Greatly appreciate the help.

(8.1) By Keith Medcalf (kmedcalf) on 2020-04-15 12:21:19 edited from 8.0 in reply to 7 [link] [source]

You will have to read the documentation for the wrapper (unixodbc) and see how you pass in a "binary" object. Or perhaps easier, use the sqlite3 CLI to create a database with a table and put a blob in it, select from it, and see what you get -- I would expect that the datatype would be bidirectional.

create table x(x);
insert into x values (x'0123456789');
select typeof(x), x from x;
-- blob|#Eg?

For example, python2+sqlite3/python2+apsw wrapper "blobs" go back and forth as "buffer" objects. python3+sqlite3/python3+apsw "blobs" go back and forth as byte strings (or can be sent as objects conforming to the buffer protocol).

In the IBM DB2 CLI (which became ODBC) this would be one of the SQL_BINARY types, not an SQL_CHAR type. SQL_CHAR types being text. Can't help more as I haven't actually used the DB2 CLI for a long time ...

(4) By Bill (WHS) on 2020-04-11 16:57:00 in reply to 1 [source]

When a constraint on a TEXT column specifies a length value, is this bytes or characters? We've been assuming it's always bytes....but your clarifications for size make me wonder.....

(5.1) By Keith Medcalf (kmedcalf) on 2020-04-11 17:24:12 edited from 5.0 in reply to 4 [link] [source]

You would have to give an example of what you mean by "constraint on a TEXT column specifies a length value", because there is no way in SQLite3 to do that.

Unless of course you mean a CHECK constraint such as:

create table x
(
  x text not null check (typeof(x) == 'text' and length(x) <= 15)
);

which would check that the value stored in x is indeed text and that it's length (in characters) is less than or equal to 15.

Simply having something like:

create table x
(
 x text not null check (length(x) <= 15)
);

would require that any particular value stored in x has a length less than or equal to 15 (characters if what is stored is text, bytes if what is stored is a blob, and a string representation less than 15 characters long if an integer or real value).

If you always want to know "bytes" then you would have to specify that, as in:

create table x
(
  x text not null check (length(cast(x as blob)) <= 15)
);

which would check that whatever is stored in x is less than or equal to 15 bytes long. If x was text and the database encoding was utf16, then this would be a max of 7 characters.

(6) By Bill (WHS) on 2020-04-11 23:57:31 in reply to 5.1 [link] [source]

Thanks Keith, that's exactly what I was asking.

Nice to know I can check for both objectives, although for the most part we use these kinds of checks as a double check that unacceptable values are caught by the Database if programming errors elsewhere in the system are not done correctly. In this case, we really did want characters, but the majority of our code was written for ASCI, and as we are moving to UTF8 for storage, we can no longer assume byte=character.

It's going to take a lot of time for the development assumptions in a code base that's over 20 years old to be fully audited, as we don't have nearly the code testing and validation you have in SQLite.