SQLite Forum

Timeline
Login

16 forum posts by user mike.mcternan

2021-05-24
15:32 Reply: Segfault in memjrnlWrite() (artifact: 1aff495aca user: mike.mcternan)

Starting with a database that reliably segfaulted everytime, I updated and ran the following version and see the problem is fixed:

3.36.0 2021-05-24 14:35:19 17960165f5840cab45b7a8bb02779ebfb321c68f33ec6da9ab14063ccd134fa4

Many thanks for the fix and sorry I couldn't provide a small reproduction - thank you again for spending the time to figure it out, and your description on the change looks accurate to to the code I'm running too.

07:48 Reply: Proposed updates to https://sqlite.org/howtocorrupt.html (artifact: 98dd772fbd user: mike.mcternan)

Section 2.6 seems pretty clear on interaction with fork():

Do not open an SQLite database connection, then fork(), then try to use that database connection in the child process.

Perhaps code under -DSQLITE_ENABLE_API_ARMOR could use pthread_atfork() to mark as invalid the database connections in child processes.

2021-05-22
21:18 Reply: Segfault in memjrnlWrite() (artifact: c1de7e91cb user: mike.mcternan)

Apologies, the schema is quite large and there are lots of parameters in the query which I think are all mostly noise so I've not shared it. I've tried making a reduction to demonstrate / recreate the problem, but not been sucessful. The query itself is mostly in the stack dump, but I think it's unremarkable (though my assesment is fairly unqualified), though it may be relevant that I'm performing lots of inserts.

So I've been trying to bisect Fossil changes to see where things start to go wrong.

This has proved difficult as the crash doesn't always reliably happen. It's actually a bit strange, because if I run say the 3.35.0 release until it crashes, it will then crash every time I re-attempt the same transaction. Going down through Fossil revisions will show the crash until I hit a revision which doesn't crash. And then at that point, if I go back to 3.35.0 it won't immediately crash again and I have to run for 20 minutes or longer to get it back into the 'crash every time' state so I can try a new bisection point. If I backup the sqlite files (-wal and -shm) when it is in 'crash every time' mode and restore them, it will make make the 'faulty' versions crash every time, if that makes sense.

Having done this a lot, I believe this change is the culprit and the area of the change fits the area of the segfault:

https://www.sqlite.org/cgi/src/info/23ca23894af352ea

Specifically, sources here and later will segfault after some time and then segfault every time the transaction is retried:

3.35.0 2021-02-23 16:40:47 23ca23894af352ea351c9efcdd7d86b82455f4c81b6001052a6d13aa2d70alt2

Sources from here, and preceeding changes, do not segfault in these tests:

3.35.0 2021-02-23 15:53:22 20689468100aed264877111367b42837ca19e63e717fed2ebd4b20b908f13178

I am using PRAGMA journal_mode=WAL and PRAGMA temp_store=2, in case it is relevant.

2021-05-21
16:37 Reply: Segfault in memjrnlWrite() (artifact: 185d0d570a user: mike.mcternan)

From further testing, it looks like sqlite-amalgamation-3340100 doesn't crash, but sqlite-amalgamation-3350000 does segfault - in memjrnlWrite().

12:12 Edit: Segfault in memjrnlWrite() (artifact: 0b6fa95b73 user: mike.mcternan)

I'm seeing a segfault in memjrnlWrite(), which happens sometimes, but not every time, during a specific SQL sequence when running my application using sqlite-amalgamation-3350500. From what I can tell of my testing, it does not happen with sqlite-amalgamation-3320300 (testing a negative though, hence some caution).

If I build & run with asan / ubsan, this get segfault eventually picked up as an attempted NULL pointer access, but the application is otherwise clean and stable. I believe I've seen this both on a 32-bit ARMv7 build of my application, and on 64-bit x86_64.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:342
342             movl    %ecx, -4(%rdi,%rdx)
(gdb) ba
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:342
#1  0x00007f84e6ec6e5d in memjrnlWrite (pJfd=0x14cb568, zBuf=0x7ffcc7f221bc, iAmt=4, iOfst=6252500) at sqlite3.c:98184
#2  0x00007f84e6e7fd25 in sqlite3OsWrite (id=0x14cb568, pBuf=0x7ffcc7f221bc, amt=4, offset=6252500) at sqlite3.c:23339
#3  0x00007f84e6e904fa in write32bits (fd=0x14cb568, offset=6252500, val=73533) at sqlite3.c:53268
#4  0x00007f84e6e93e9a in subjournalPage (pPg=0x17bb690) at sqlite3.c:56634
#5  0x00007f84e6e93f28 in subjournalPageIfRequired (pPg=0x17bb690) at sqlite3.c:56649
#6  0x00007f84e6e95ca9 in sqlite3PagerWrite (pPg=0x17bb690) at sqlite3.c:58297
#7  0x00007f84e6ea3bfa in insertCell (pPage=0x17bb6d8, i=122, pCell=0x14fc40c "\t\003\003\003\003\242\023\020%3\001\001\001\001\b\t\b\b", sz=10, pTemp=0x0, iChild=0, pRC=0x7ffcc7f22394) at sqlite3.c:71755
#8  0x00007f84e6ea7adc in sqlite3BtreeInsert (pCur=0x1481ec0, pX=0x7ffcc7f223f0, flags=0, seekResult=0) at sqlite3.c:73862
#9  0x00007f84e6ebe993 in sqlite3VdbeExec (p=0x14e35b8) at sqlite3.c:91803
#10 0x00007f84e6eb5a68 in sqlite3Step (p=0x14e35b8) at sqlite3.c:84331
#11 0x00007f84e6eb5cbe in sqlite3_step (pStmt=0x14e35b8) at sqlite3.c:84388
#12 0x00007f84e6ef387a in sqlite3_exec (db=0x147a548, zSql=0x7ffcc7f22e60 "INSERT INTO envncellref(cellid, channelid, code) SELECT  58663, newchan.id, oldncr.code FROM  envncellref AS oldncr, envplmnchannel AS oldchan, envplmnchannel AS newchan, envplmn AS oldplmn, envplmn A"..., xCallback=0x0, pArg=0x0, pzErrMsg=0x0) at sqlite3.c:125293
...
#20 0x0000000000415c77 in main (argc=1, argv=0x7ffcc7f24618) at main.c:171
(gdb) frame 1
#1  0x00007f84e6ec6e5d in memjrnlWrite (pJfd=0x14cb568, zBuf=0x7ffcc7f221bc, iAmt=4, iOfst=6252500) at sqlite3.c:98184
98184           memcpy((u8*)p->endpoint.pChunk->zChunk + iChunkOffset, zWrite, iSpace);
(gdb) print p
$1 = (MemJournal *) 0x14cb568
(gdb) print *p
$2 = {pMethod = 0x7f84e6f58940 <MemJournalMethods>, nChunkSize = 1016, nSpill = -1, pFirst = 0x1d712f8, endpoint = {iOffset = 6252500, pChunk = 0x0}, readpoint = {iOffset = 0, pChunk = 0x0}, flags = 8222, pVfs = 0x7f84e6f5aea0 <aVfs.76>, zJournal = 0x0}
(gdb) print p->endpoint.pChunk 
$3 = (FileChunk *) 0x0

The INSERT is actually nested inside a SELECT from a temporary table, and there's some other INSERTs also happening as rows are copied and modified according to business logic. All the access is happening on the same connection and within a savepoint. There are no threads or other database connections within the segfaulting process, though there maybe other processes attempting to concurrently access the same database.

(Edit: Further testing shows the crash also happens when only one process is accessing the database and the others are stopped.)

My config is like this:

3.35.5 2021-04-19 18:32:05 1b256d97b553a9611efca188a3d995a2fff712759044ba480f9a0c9e98fae886
COMPILER=gcc-10.3.1 20210422 (Red Hat 10.3.1-1)
DEFAULT_FOREIGN_KEYS
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
HAVE_ISNAN
LIKE_DOESNT_MATCH_BLOBS
MAX_EXPR_DEPTH=0
OMIT_AUTHORIZATION
OMIT_DECLTYPE
OMIT_DEPRECATED
OMIT_LOAD_EXTENSION
OMIT_PROGRESS_CALLBACK
OMIT_SHARED_CACHE
OMIT_UTF16
REVERSE_UNORDERED_SELECTS
THREADSAFE=1

I'm very willing (hopeful even) to consider this is a bug in my application or API usage, but I'm struggling to see what could have gone wrong to cause this.

Any suggestions for things to try would be gratefully received, though I've not been able to make a simple reproduction of this case yet.

11:51 Post: Segfault in memjrnlWrite() (artifact: ce9ee17e07 user: mike.mcternan)

I'm seeing a segfault in memjrnlWrite(), which happens sometimes, but not every time, during a specific SQL sequence when running my application using sqlite-amalgamation-3350500. From what I can tell of my testing, it does not happen with sqlite-amalgamation-3320300 (testing a negative though, hence some caution).

If I build & run with asan / ubsan, this get segfault eventually picked up as an attempted NULL pointer access, but the application is otherwise clean and stable. I believe I've seen this both on a 32-bit ARMv7 build of my application, and on 64-bit x86_64.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:342
342             movl    %ecx, -4(%rdi,%rdx)
(gdb) ba
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:342
#1  0x00007f84e6ec6e5d in memjrnlWrite (pJfd=0x14cb568, zBuf=0x7ffcc7f221bc, iAmt=4, iOfst=6252500) at sqlite3.c:98184
#2  0x00007f84e6e7fd25 in sqlite3OsWrite (id=0x14cb568, pBuf=0x7ffcc7f221bc, amt=4, offset=6252500) at sqlite3.c:23339
#3  0x00007f84e6e904fa in write32bits (fd=0x14cb568, offset=6252500, val=73533) at sqlite3.c:53268
#4  0x00007f84e6e93e9a in subjournalPage (pPg=0x17bb690) at sqlite3.c:56634
#5  0x00007f84e6e93f28 in subjournalPageIfRequired (pPg=0x17bb690) at sqlite3.c:56649
#6  0x00007f84e6e95ca9 in sqlite3PagerWrite (pPg=0x17bb690) at sqlite3.c:58297
#7  0x00007f84e6ea3bfa in insertCell (pPage=0x17bb6d8, i=122, pCell=0x14fc40c "\t\003\003\003\003\242\023\020%3\001\001\001\001\b\t\b\b", sz=10, pTemp=0x0, iChild=0, pRC=0x7ffcc7f22394) at sqlite3.c:71755
#8  0x00007f84e6ea7adc in sqlite3BtreeInsert (pCur=0x1481ec0, pX=0x7ffcc7f223f0, flags=0, seekResult=0) at sqlite3.c:73862
#9  0x00007f84e6ebe993 in sqlite3VdbeExec (p=0x14e35b8) at sqlite3.c:91803
#10 0x00007f84e6eb5a68 in sqlite3Step (p=0x14e35b8) at sqlite3.c:84331
#11 0x00007f84e6eb5cbe in sqlite3_step (pStmt=0x14e35b8) at sqlite3.c:84388
#12 0x00007f84e6ef387a in sqlite3_exec (db=0x147a548, zSql=0x7ffcc7f22e60 "INSERT INTO envncellref(cellid, channelid, code) SELECT  58663, newchan.id, oldncr.code FROM  envncellref AS oldncr, envplmnchannel AS oldchan, envplmnchannel AS newchan, envplmn AS oldplmn, envplmn A"..., xCallback=0x0, pArg=0x0, pzErrMsg=0x0) at sqlite3.c:125293
...
#20 0x0000000000415c77 in main (argc=1, argv=0x7ffcc7f24618) at main.c:171
(gdb) frame 1
#1  0x00007f84e6ec6e5d in memjrnlWrite (pJfd=0x14cb568, zBuf=0x7ffcc7f221bc, iAmt=4, iOfst=6252500) at sqlite3.c:98184
98184           memcpy((u8*)p->endpoint.pChunk->zChunk + iChunkOffset, zWrite, iSpace);
(gdb) print p
$1 = (MemJournal *) 0x14cb568
(gdb) print *p
$2 = {pMethod = 0x7f84e6f58940 <MemJournalMethods>, nChunkSize = 1016, nSpill = -1, pFirst = 0x1d712f8, endpoint = {iOffset = 6252500, pChunk = 0x0}, readpoint = {iOffset = 0, pChunk = 0x0}, flags = 8222, pVfs = 0x7f84e6f5aea0 <aVfs.76>, zJournal = 0x0}
(gdb) print p->endpoint.pChunk 
$3 = (FileChunk *) 0x0

The INSERT is actually nested inside a SELECT from a temporary table, and there's some other INSERTs also happening as rows are copied and modified according to business logic. All the access is happening on the same connection and within a savepoint. There are no threads or other database connections within the segfaulting process, though there maybe other processes attempting to concurrently access the same database.

My config is like this:

3.35.5 2021-04-19 18:32:05 1b256d97b553a9611efca188a3d995a2fff712759044ba480f9a0c9e98fae886
COMPILER=gcc-10.3.1 20210422 (Red Hat 10.3.1-1)
DEFAULT_FOREIGN_KEYS
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
HAVE_ISNAN
LIKE_DOESNT_MATCH_BLOBS
MAX_EXPR_DEPTH=0
OMIT_AUTHORIZATION
OMIT_DECLTYPE
OMIT_DEPRECATED
OMIT_LOAD_EXTENSION
OMIT_PROGRESS_CALLBACK
OMIT_SHARED_CACHE
OMIT_UTF16
REVERSE_UNORDERED_SELECTS
THREADSAFE=1

I'm very willing (hopeful even) to consider this is a bug in my application or API usage, but I'm struggling to see what could have gone wrong to cause this.

Any suggestions for things to try would be gratefully received, though I've not been able to make a simple reproduction of this case yet.

2020-09-28
05:32 Reply: Nesting of statements (artifact: 1b17a447b7 user: mike.mcternan)

Interleaving the row fetching from a SELECT with concomitant UPDATE involving common tables in both is fraught with peril.

The proper solution is to not do that.

It would be nice if the sqlite manual said this somewhere, and if it could reliably detect and report API misuse in such cases (when SQLITE_ENABLE_API_ARMOR is set).

05:24 Reply: Nesting of statements (artifact: 07f6e9beac user: mike.mcternan)

Lots of if/else stuff, looking up some rules and limits in config files and then calling into some libraries.

I have used sqlite3_create_function() in the past for some stuff, but in this case SQL doesn't seem like the best language for the particular calculation/problem.

2020-09-25
16:30 Reply: Nesting of statements (artifact: b70c0333d2 user: mike.mcternan)

Using a temporary table of Ids is a pattern I've used before in this type of situation. It works well when the data set isn't too large - it's a good solution, thank you.

Alternatively using a temporary table for the calculated results is a different way I've done similar.

I guess what I was hoping for though were some rules on if UPDATEs during a SELECT are ever deterministic and safe or not. For example, adding an ORDER BY clause on the select could guarantee the output order with respect to some index, and possibly make things more predictable...

What I'm hearing though is that nesting UPDATES to a table while in a SELECT is unpredictably unsafe - though most of the time sqlite doesn't seem to notice and just skips rows or returns a subset. I'm not sure if it was me or sqlite, but I may have once got into an infinite loop too... :-/

Anyway, this seems like a bit of a trap in an otherwise impeccable bit of software. It would be nice if there were either documentation highlighting the weakness inherent in such a use case, or it was consistently trapped as API misuse.

I believe other databases handle such a use case as one may expect, but then they have MVCC or lots of other fancy stuff - the cost of which I don't really want in my embedded applications for which sqlite is perfect!

Thank you all!

16:20 Reply: Nesting of statements (artifact: 7b90b4d631 user: mike.mcternan)

Correct, it's one of my helper functions like:

sqlite3_stmt *sqlite3_preparef_v2(sqlite3 *db, const char *fmt, ...) __attribute__ ((sentinel));
bool          sqlite3_queryf(sqlite3 *db, const char *fmt, ...) __attribute__ ((sentinel)); 

They take multiple arguments and bind the parameters using a printf-style format string. It's super convenient when you only need to bind parameters once, and check all the return codes. The queryf function is handy for updates or deletes.

16:14 Reply: Nesting of statements (artifact: b6c9570cb4 user: mike.mcternan)

Sorry - yes it is just example code to illustrate the nesting of an update within a select, which for reasons of the application logic, is hard to avoid.

You are right about there always being a transaction.

08:54 Post: Nesting of statements (artifact: 7ba40bf60b user: mike.mcternan)

I'm looking for rules on nesting SQL statements at the C API level and can't find anything definitive on this site.

Here's an example of the sort of think I'm trying to do:

sqlite3_prepare_v2(db, "SELECT id FROM table", -1, ps, NULL);

while(sqlite3_step(ps) == SQLITE_ROW)
{
  int id = sqlite3_column_int(ps, 0);

  sqlite3_execf(db, "UPDATE table SET x=1 WHERE id=?", id);
}

sqlite3_finalize(db);

This example is very simple and contrived - it's easily converted into a single SQL statement which is better in every way. However, in the real cases there is some complex business logic between the select and the update with features not easily replicated in SQL.

If the above isn't in a transaction, it looks like we get API misuse returned when trying the update. If in a transaction, it is okay (except the risk of SQLITE_BUSY due to upgrade from reader to writer - using an immediate transaction avoids that).

So I think I've worked out the rules, but it would nice to have a canonical reference in case things change in the future!

2020-05-31
03:03 Reply: warning: function may return address of local variable -Wreturn-local-addr (artifact: 4b3c201b25 user: mike.mcternan)

Okay, last one on this topic for me, but I'm getting better mileage with -Os and the fix:

gcc version 10.1.1 20200507 (Red Hat 10.1.1-1) (GCC) -Os

text data bss dec hex filename
530810 8448 1784 541042 84172 sqlite-amalgamation-3310100-fixed.o
530826 8448 1784 541058 84182 sqlite-amalgamation-3310100.o
534732 8512 1784 545028 85104 sqlite-amalgamation-3320100-fixed.o
534748 8512 1784 545044 85114 sqlite-amalgamation-3320100.o

Thank you for your work on this, and sqlite as a whole.

2020-05-30
07:32 Reply: warning: function may return address of local variable -Wreturn-local-addr (artifact: bec41351e7 user: mike.mcternan)

I see this made it back to trunk already, but just as an addendum, I roughly checked the overhead of the fix, using gcc (GCC) 10.1.1 20200507 (Red Hat 10.1.1-1) at -O2:

text data bss dec hex filename
798015 8448 1784 808247 c5537 sqlite-amalgamation-3310100-fixed.o
798015 8448 1784 808247 c5537 sqlite-amalgamation-3310100.o
802126 8512 1784 812422 c6586 sqlite-amalgamation-3320100-fixed.o
802126 8512 1784 812422 c6586 sqlite-amalgamation-3320100.o

It doesn't look any different with out without the workaround applied to the last pair of releases. Phew!

2020-05-26
06:29 Post: test_vfstrace.c missing wrappers for xFetch() and xUnfetch() (artifact: 41744e64c1 user: mike.mcternan)

Really minor point for a rainy day, but test_vfstrace.c is missing wrappers for xFetch and xUnfetch.

When using this in WAL mode, it caused a segfault as these functions could be called (xUnfetch in my test), but are not initialised in the sqlite3_io_methods structure.

06:20 Reply: warning: function may return address of local variable -Wreturn-local-addr (artifact: e39020f82d user: mike.mcternan)

Your work-around works for me too.

I see this with gcc 10.1, as shipped with Fedora 32, when compiling at -O2 and above:

$ gcc -v Using built-in specs. COLLECT_GCC=/bin/gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-redhat-linux/10/lto-wrapper OFFLOAD_TARGET_NAMES=nvptx-none OFFLOAD_TARGET_DEFAULT=1 Target: x86_64-redhat-linux Configured with: ../configure --enable-bootstrap --enable-languages=c,c++,fortran,objc,obj-c++,ada,go,d,lto --prefix=/usr --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=http://bugzilla.redhat.com/bugzilla --enable-shared --enable-threads=posix --enable-checking=release --enable-multilib --with-system-zlib --enable-__cxa_atexit --disable-libunwind-exceptions --enable-gnu-unique-object --enable-linker-build-id --with-gcc-major-version-only --with-linker-hash-style=gnu --enable-plugin --enable-initfini-array --with-isl --enable-offload-targets=nvptx-none --without-cuda-driver --enable-gnu-indirect-function --enable-cet --with-tune=generic --with-arch_32=i686 --build=x86_64-redhat-linux Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 10.1.1 20200507 (Red Hat 10.1.1-1) (GCC) $ $ gcc -O2 -c sqlite3.c sqlite3.c: In function ‘sqlite3SelectNew’: sqlite3.c:128048:10: warning: function may return address of local variable [-Wreturn-local-addr] 128048 | return pNew; | ^~~~ sqlite3.c:128008:10: note: declared here 128008 | Select standin; | ^~~~~~~

This was with sqlite-amalgamation-3310100, but fixed by your work-around at -O2 and -Wall -O3.

Note that I did try other workarounds, but couldn't get any to work.

Specifically I think the problem in the original code is that gcc can't track pParse->db->mallocFailed, either having to assume it could be aliased or modified by intermediate function calls.

Replacing the condition above the call to clearSelect() with if ( pNew==&standin || pParse->db->mallocFailed ) should be robust (and cheap), but still doesn't fix the warning .

I'm not sure how much overhead your workaround adds (it looks fairly cheap to my eyes, but I am not a compiler!) but I'd guess the warning will become a lot more prevalent now GCC 10 is shipping with distros.