SQLite Forum

Suggestion to support gzipped text in vsv extension
Login

Suggestion to support gzipped text in vsv extension

(1) By Víctor (wzrlpy) on 2021-09-19 18:20:49 [link] [source]

Hello Keith,

Since SQLite has added zlib to its usual dependencies, I thought worth supporting gzipped text in your vsv extension.

It ended up being very simple:

--- ../KeithMedcalf/vsv.c       2021-07-07 17:44:36.000000000 +0200
+++ ./vsv.c     2021-09-19 20:13:32.281978471 +0200
@@ -198,6 +198,9 @@
 **    CREATE VIRTUAL TABLE temp.csv USING csv(filename=FILENAME);
 **    SELECT * FROM csv;
 **
+** The file content can be plain text, or text compressed with gzip (if this
+** code was compiled with "-DSQLITE_HAVE_ZLIB -lz").
+**
 ** The columns are named "c1", "c2", "c3", ... by default.  Or the
 ** application can define its own CREATE TABLE statement using the
 ** schema= parameter, like this:
@@ -226,6 +229,15 @@
 #include <stdio.h>
 #include <math.h>

+#ifdef SQLITE_HAVE_ZLIB
+#include <zlib.h>
+#define fopen  gzopen
+#define fclose gzclose
+#define fread  gzfread
+#define fseek  gzseek
+#define ftell  gztell
+#endif
+
 #ifndef SQLITE_OMIT_VIRTUALTABLE

 /*
@@ -257,7 +269,11 @@
 typedef struct VsvReader VsvReader;
 struct VsvReader
 {
+#ifdef SQLITE_HAVE_ZLIB
+    gzFile in;             /* Read the VSV text from this compressed input stream */
+#else
     FILE *in;              /* Read the VSV text from this input stream */
+#endif
     char *z;               /* Accumulated text for a field */
     int n;                 /* Number of bytes in z */
     int nAlloc;            /* Space allocated for z[] */

Would you please consider incorporating this patch (or variations of it) into your code? I of course give this patch a public domain license.

Thanks in advance, Víctor

(2) By Keith Medcalf (kmedcalf) on 2021-09-20 02:02:52 in reply to 1 [link] [source]

Added the patch and it appears to work correctly both when "in-line" and when used as a loadable extension. When building as a loadable extension you have to define SQLITE_HAVE_ZLIB and add the zlib library when linking.

The latest code is here: http://www.dessus.com/files/vsv.c
and the whole bunch are: http://www.dessus.com/files/sqlite3extensions.zip

(3) By Larry Brasfield (larrybr) on 2021-09-20 02:18:30 in reply to 1 [link] [source]

Victor, or Keith:

Could one of you please briefly describe, in black-box user's terms, what this enhancement to the vsv extension does?

I get from the posted patch that the input stream can be compressed. What I do not see without digging a lot further is whether ordinary uncompressed delimited files can still be used. (Does this depend on a magic header?) It is also unclear what compression utility produces the now-usable compressed input streams.

Given the ubiquity of delimited text data representation, this looks likely to be useful to others here if only they can see what it does.

Thanks.

(4) By Keith Medcalf (kmedcalf) on 2021-09-20 02:30:36 in reply to 3 [link] [source]

This change allows the vsv.c extension to read EITHER a plain-text file, OR a gzipped plain-text file as the filename= parameter automagically.

If you have a file called THEFILE.CSV then you can gzip that file using the command gzip THEFILE.CSV which will create a file called THEFILE.CSV.gz

See https://en.wikipedia.org/wiki/Gzip

For a file that is about 32 MB, the GZipped stream is about 5 MB.

(5) By Víctor (wzrlpy) on 2021-09-20 08:54:00 in reply to 4 [link] [source]

Sorry for being so succint :-) And thanks Keith for your kind clarification, and applying the patch.

I only entered a brief description as a comment in the patch:

+** The file content can be plain text, or text compressed with gzip (if this
+** code was compiled with "-DSQLITE_HAVE_ZLIB -lz").

Definitively the word "automagically" would have helped in the comment.

The gzopen() function detects the magic signature of gzip compression (1f 8b), at the beginning of the file, and later the other gz* functions in zlib know if they need to operate either on a compressed or on an uncompressed file. For our vsv.c purposes, the file content is always text (compressed or not), but in fact the gz* functions work as well on binary content.

(6) By anonymous on 2021-09-21 00:27:05 in reply to 5 [source]

Thanks for bringing this topic up. I have also tested it statically linked and everything works as expected.