How to use zipfile with archive in memory?
(1.1) By PaulK (paulclinger) on 2022-11-05 06:51:15 edited from 1.0 [link] [source]
I have a string that contains a zip archive that I'd like to pass to zipfile() to return archive content. What's the correct syntax for doing this? Using "select * from zipfile(?)"
expects a file name to be passed as a parameter, but I'd like to pass the archive content itself. The documentation seem to indicate that it should be possible ("A table-valued function, which provides read-only access to existing archives, either from the file-system or in-memory."), but how do I reference this "in-memory" location? Thank you.
(2) By Richard Hipp (drh) on 2022-11-05 12:01:46 in reply to 1.1 [link] [source]
If you bind the ? parameter using sqlite3_bind_text(), then that is interpreted as a filename, and content is ready from that file. However, if you bind using sqlite3_bind_blob() then the content of the blob as the ZIP archive itself.
More generally, if the datatype of the parameter to the zipfile table-valued-function is TEXT, then it is a filename. If the datatype is BLOB, then it is the ZIP archive content.
Suppose your ZIP archive is named 'xyz.zip'. Then the following two statements (in the CLI) give the same answer:
SELECT name, sz FROM zipfile('xyz.zip'); SELECT name, sz FROM zipfile(readfile('xyz.zip'));
The first statement accesses the ZIP archive directly, because it is provided with a TEXT filename. The second is given the BLOB value returned from readfile() which is the content of the ZIP archive.
(3) By PaulK (paulclinger) on 2022-11-05 20:59:01 in reply to 2 [source]
Thanks Richard! This works, but I was also interested if there is a function that could turn a string into a blob (somewhat similar to how zeroblob() returns a blob), so I can pass a string using regular bindings:
select name, sz from zipfile(somefunc(?)) -- and pass archive binary string as a parameter
(4) By Donal Fellows (dkfellows) on 2022-11-07 12:15:54 in reply to 3 [link] [source]
Not by default I think. You can always just add one to the connection (which would be both deterministic and innocuous, provided you don't mind the issues relating to how to handle large data).
(5) By ddevienne on 2022-11-07 12:30:59 in reply to 3 [link] [source]
I have trouble understanding your question, I'm afraid.
Using the C API, you can bind a string or a blob, so no problem, and no need to turn-string-into-a-blob.
If somehow your API (not in C I guess?) can only bind strings (surprising...), you can always cast(:1 as blob)
I guess.
Or perhaps you are not really binding, but using a textual substitution in your SQL, with a string literal?
If that's the case, you should be doing real binding :) If you persist, you can substitute blob literals too, of the form x'ab...'
in hexadecimal pairs (per-byte).
But I suspect I'm missing something, given your question.
(6) By PaulK (paulclinger) on 2022-11-08 03:56:03 in reply to 5 [link] [source]
Using the C API, you can bind a string or a blob, so no problem, and no need to turn-string-into-a-blob.
Correct, I can do that and it works for me, but I was also looking for a way to do this without blob binding, just using a regular string parameter.
If somehow your API (not in C I guess?) can only bind strings (surprising...), you can always cast(:1 as blob) I guess.
Yes, that's what I was looking for; thank you. Somehow I missed it in the CAST expressions section (https://www.sqlite.org/lang_expr.html#castexpr), as BLOB is not listed as one of the types in the Affinity column.
There is one small issue though: the casting does work, but I'm not getting any results back. Using fetch("select name, data from zipfile(cast(? as blob))", '123')
returns "cannot find end of central directory record" as expected (so it looks like the BLOB conversion works), but passing the result of "SELECT zipfile('foo', 'bar content') as archive"
instead of '123' doesn't trigger any error, but doesn't return anything either. I do get the correct result when I bind the content as a blob directly (without casting), so this is mostly a curiosity at this point. Let me know if there is any more information I can provide (this is running 3.35.5 version, which is fairly old).
(7) By Keith Medcalf (kmedcalf) on 2022-11-08 04:31:40 in reply to 6 [link] [source]
TEXT cannot contain embedded nulls. That is, the definition of "TEXT" is a sequence of non-zero words followed by a zero word -- and a word may be 1 or 2 bytes. Unless, of course, you take "special measures" to handle embeded zero words in the text -- and it sounds like you have failed to do so.
(8) By Keith Medcalf (kmedcalf) on 2022-11-08 04:55:08 in reply to 6 [link] [source]
Seems to work just peachy-keen:
sqlite> select * from zipfile((select zipfile('foo', 'bar content')));
┌───────┬───────┬────────────┬────┬───────────────────────────┬───────────────────────────┬────────┐
│ name │ mode │ mtime │ sz │ rawdata │ data │ method │
├───────┼───────┼────────────┼────┼───────────────────────────┼───────────────────────────┼────────┤
│ 'foo' │ 33188 │ 1667883111 │ 11 │ x'62617220636f6e74656e74' │ x'62617220636f6e74656e74' │ 0 │
└───────┴───────┴────────────┴────┴───────────────────────────┴───────────────────────────┴────────┘
sqlite> select cast(x'62617220636f6e74656e74' as text);
┌─────────────────────────────────────────┐
│ cast(x'62617220636f6e74656e74' as text) │
├─────────────────────────────────────────┤
│ 'bar content' │
└─────────────────────────────────────────┘
(9) By PaulK (paulclinger) on 2022-11-08 05:49:26 in reply to 7 [link] [source]
Thank you for the explanation; I suspect this may be exactly what's happening.
I wonder what these "special measures" may be though, as I do store generated changesets (which may have embedded zeros) as blobs using sqlite3_bind_text without issues, as I do provide the length for those fields and I thought that may be enough since the documentation states that "If any NUL characters occurs [sic] at byte offsets less than the value of the fourth parameter then the resulting string value will contain embedded NULs." (https://www.sqlite.org/c3ref/bind_blob.html).
In other words, do I always need to use bind_blob (instead of bind_text with explicit length) when storing bytes that may have embedded zeros in a field with type BLOB?
(10) By ddevienne on 2022-11-08 07:38:58 in reply to 9 [link] [source]
In other words, do I always need to use bind_blob
Yes and no...
Binding text values with embedded NULLs is fine, and works.
It's SQL functions operating on text that assume no embedded NULLs.
They could handle embedded NULLs fine, the same way a C++ std::string
can.
They choose not to... Possibly for historical reasons, which cannot be changed now, for BC.
SQLite values always know their sizes, be it fixed (null, integer, real), or varying (text, blob).
I'd be surprised if CAST'ing from text to blob was not supporting embedded NULLs though. I'll have to double-check.
So I'm not sure what you are doing exactly to run into that issue, to be honest.
(11.1) By ddevienne on 2022-11-08 10:21:57 edited from 11.0 in reply to 10 [link] [source]
Well, consider me surprised :)
First, there seems to be a small formatting bug in .mode box
linked to blob values.
I'm not using the latest, so maybe it's fixed with a newer version?
Update: I've built the draft "SQLite version 3.40.0 2022-11-06 17:19:34" and the same happens.
Second, CAST'ing a blob value to text stops at the first embedded NULL indeed!
I'm sure it's not a bug, yet I was NOT expecting that. CAST is not a string function in my mind.
And for what it's worth, I was ALSO NOT expecting a text-to-blob to change the bytes either.
CAST between text and blob should be idempotent on the values' bytes IMHO. Thus not affect its internal size.
This feels very wrong!!! Of well, I don't expect something as fundamental to be actually wrong, the problem is likely me :)
D:\pdgm\trunk\psc3>sqlite3
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .mode box
sqlite> create table t(v, s as (cast(v as text)), b as (cast(v as blob)));
sqlite> insert into t(v) values ('ab00cd'), (x'ab00cd');
sqlite> select v, typeof(v), length(v), s, typeof(s), length(s), hex(b), typeof(b), length(b) from t;
┌────────┬───────────┬───────────┬────────┬───────────┬───────────┬──────────────┬───────────┬───────────┐
│ v │ typeof(v) │ length(v) │ s │ typeof(s) │ length(s) │ hex(b) │ typeof(b) │ length(b) │
├────────┼───────────┼───────────┼────────┼───────────┼───────────┼──────────────┼───────────┼───────────┤
│ ab00cd │ text │ 6 │ ab00cd │ text │ 6 │ 616230306364 │ blob │ 6 │
│ ? │ blob │ 3 │ ? │ text │ 1 │ AB00CD │ blob │ 3 │
└────────┴───────────┴───────────┴────────┴───────────┴───────────┴──────────────┴───────────┴───────────┘
sqlite>
(12) By anonymous on 2022-11-08 11:58:21 in reply to 11.1 [link] [source]
Second, CAST'ing a blob value to text stops at the first embedded NULL indeed!
Nope. What your test shows is that the length
function stops at the first NUL
when given a text argument.
Try feeding this to the CLI instead:
create table t1(c blob not null); create table t2(c text not null); create table t3(c blob not null); insert into t1 values (x'616200'),(x'610062'),(x'006162'); insert into t2 select cast(c as text) from t1; insert into t3 select cast(c as blob) from t2; .headers off .mode insert t1 select * from t1; .mode insert t3 select * from t3;
I get this output:
INSERT INTO t1 VALUES(X'616200'); INSERT INTO t1 VALUES(X'610062'); INSERT INTO t1 VALUES(X'006162'); INSERT INTO t3 VALUES(X'616200'); INSERT INTO t3 VALUES(X'610062'); INSERT INTO t3 VALUES(X'006162');
In other words, x'00'
survives blob-to-text-to-blob casting just fine.
(13) By ddevienne on 2022-11-08 12:20:35 in reply to 12 [link] [source]
Thanks! I'm happy to be wrong on this one, since this is what I expected.
I wish there was a built-in sizeof
which returns the true / internal value byte size, because length
is indeed misleading for text with embedded nulls.
Not difficult to code, of course, but should be in the CORE, really.
sqlite> select typeof(1);
integer
sqlite> select sizeof(1);
Error: no such function: sizeof
(14) By PaulK (paulclinger) on 2022-11-09 04:05:22 in reply to 11.1 [link] [source]
CAST between text and blob should be idempotent on the values' bytes IMHO. Thus not affect its internal size.
That's my impression as well after reading the code (esp. applyAffinity function that explicitly states that conversion to BLOB is no-op).
So I'm not sure what you are doing exactly to run into that issue, to be honest.
Everything works for me (even with embedded NULLs), as in my Lua bindings I always know the exact length, except using "select name, data from zipfile(cast(? as blob))"
and passing zip content with sqlite3_bind_text
function (and just to be sure I'm passing the exact length as its 4th parameter). As I wrote, I don't get any error, but the select doesn't return any content either. When I use bind_blob
instead of bind_text
(with exactly the same query) I get the expected result, so there does seem to be some difference.
Thank you for the examples and the illuminating discussion!
(15) By Keith Medcalf (kmedcalf) on 2022-11-09 14:05:11 in reply to 13 [link] [source]
length is indeed misleading for text with embedded nulls
The definition of a C-Style text string is a sequence of non-zero "characters" teminated by a zero-character. Each "character" may be 1, 2, 4, 8, or 16 bytes long.
By this definition (which has been unchanged for at least half-a-century) embeded zero-characters cannot exist with a C-Style string and are the string terminator.
(16) By ddevienne on 2022-11-09 15:36:35 in reply to 15 [link] [source]
I guess you are being facetious.
I didn't write C-Style text string, I wrote text
, as in the SQLite value type.
Which clearly is a thing, and clearly has a payload size (pg_column_size()
in PostgreSQL-land)
that can be different from length()
, and which sqlite3_column_bytes()
returns.
The fact is that the SQL length()
function has different semantics for text
and blob
.
For text
, it is strlen(sqlite3_column_text(stmt, iCol))
.
While for blob
, it is sqlite3_column_bytes(stmt, iCol)
.
That's misleading to me, even though it is well documented.
I routinely use std::string
(or std::string_view
) values with embedded NULLs,
whose .length()
member returns the true length of the values, not strlen()
, FWIW.
The non-existent SQL sizeof()
I wish for would use the blob-semantic for BOTH text
or blob
values
(and for other types, the storage size in bytes, i.e. 1, 2, 4, 8 depending on the integer value itself,
exactly as specified in the record header, since those values don't use varints apparently.
Maybe payload_bytes()
is a better nane than sizeof()
then).