Segfault in memjrnlWrite()

(1.1) By Mike (mike.mcternan) on 2021-05-21 12:12:52 edited from 1.0 [link] [source]

I'm seeing a segfault in memjrnlWrite(), which happens sometimes, but not every time, during a specific SQL sequence when running my application using sqlite-amalgamation-3350500. From what I can tell of my testing, it does not happen with sqlite-amalgamation-3320300 (testing a negative though, hence some caution).

If I build & run with asan / ubsan, this get segfault eventually picked up as an attempted NULL pointer access, but the application is otherwise clean and stable. I believe I've seen this both on a 32-bit ARMv7 build of my application, and on 64-bit x86_64.

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:342
342             movl    %ecx, -4(%rdi,%rdx)
(gdb) ba
#0  __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:342
#1  0x00007f84e6ec6e5d in memjrnlWrite (pJfd=0x14cb568, zBuf=0x7ffcc7f221bc, iAmt=4, iOfst=6252500) at sqlite3.c:98184
#2  0x00007f84e6e7fd25 in sqlite3OsWrite (id=0x14cb568, pBuf=0x7ffcc7f221bc, amt=4, offset=6252500) at sqlite3.c:23339
#3  0x00007f84e6e904fa in write32bits (fd=0x14cb568, offset=6252500, val=73533) at sqlite3.c:53268
#4  0x00007f84e6e93e9a in subjournalPage (pPg=0x17bb690) at sqlite3.c:56634
#5  0x00007f84e6e93f28 in subjournalPageIfRequired (pPg=0x17bb690) at sqlite3.c:56649
#6  0x00007f84e6e95ca9 in sqlite3PagerWrite (pPg=0x17bb690) at sqlite3.c:58297
#7  0x00007f84e6ea3bfa in insertCell (pPage=0x17bb6d8, i=122, pCell=0x14fc40c "\t\003\003\003\003\242\023\020%3\001\001\001\001\b\t\b\b", sz=10, pTemp=0x0, iChild=0, pRC=0x7ffcc7f22394) at sqlite3.c:71755
#8  0x00007f84e6ea7adc in sqlite3BtreeInsert (pCur=0x1481ec0, pX=0x7ffcc7f223f0, flags=0, seekResult=0) at sqlite3.c:73862
#9  0x00007f84e6ebe993 in sqlite3VdbeExec (p=0x14e35b8) at sqlite3.c:91803
#10 0x00007f84e6eb5a68 in sqlite3Step (p=0x14e35b8) at sqlite3.c:84331
#11 0x00007f84e6eb5cbe in sqlite3_step (pStmt=0x14e35b8) at sqlite3.c:84388
#12 0x00007f84e6ef387a in sqlite3_exec (db=0x147a548, zSql=0x7ffcc7f22e60 "INSERT INTO envncellref(cellid, channelid, code) SELECT  58663, newchan.id, oldncr.code FROM  envncellref AS oldncr, envplmnchannel AS oldchan, envplmnchannel AS newchan, envplmn AS oldplmn, envplmn A"..., xCallback=0x0, pArg=0x0, pzErrMsg=0x0) at sqlite3.c:125293
...
#20 0x0000000000415c77 in main (argc=1, argv=0x7ffcc7f24618) at main.c:171
(gdb) frame 1
#1  0x00007f84e6ec6e5d in memjrnlWrite (pJfd=0x14cb568, zBuf=0x7ffcc7f221bc, iAmt=4, iOfst=6252500) at sqlite3.c:98184
98184           memcpy((u8*)p->endpoint.pChunk->zChunk + iChunkOffset, zWrite, iSpace);
(gdb) print p
$1 = (MemJournal *) 0x14cb568
(gdb) print *p
$2 = {pMethod = 0x7f84e6f58940 <MemJournalMethods>, nChunkSize = 1016, nSpill = -1, pFirst = 0x1d712f8, endpoint = {iOffset = 6252500, pChunk = 0x0}, readpoint = {iOffset = 0, pChunk = 0x0}, flags = 8222, pVfs = 0x7f84e6f5aea0 <aVfs.76>, zJournal = 0x0}
(gdb) print p->endpoint.pChunk 
$3 = (FileChunk *) 0x0

The INSERT is actually nested inside a SELECT from a temporary table, and there's some other INSERTs also happening as rows are copied and modified according to business logic. All the access is happening on the same connection and within a savepoint. There are no threads or other database connections within the segfaulting process, though there maybe other processes attempting to concurrently access the same database.

(Edit: Further testing shows the crash also happens when only one process is accessing the database and the others are stopped.)

My config is like this:

3.35.5 2021-04-19 18:32:05 1b256d97b553a9611efca188a3d995a2fff712759044ba480f9a0c9e98fae886
COMPILER=gcc-10.3.1 20210422 (Red Hat 10.3.1-1)
DEFAULT_FOREIGN_KEYS
DEFAULT_WAL_SYNCHRONOUS=1
ENABLE_API_ARMOR
HAVE_ISNAN
LIKE_DOESNT_MATCH_BLOBS
MAX_EXPR_DEPTH=0
OMIT_AUTHORIZATION
OMIT_DECLTYPE
OMIT_DEPRECATED
OMIT_LOAD_EXTENSION
OMIT_PROGRESS_CALLBACK
OMIT_SHARED_CACHE
OMIT_UTF16
REVERSE_UNORDERED_SELECTS
THREADSAFE=1

I'm very willing (hopeful even) to consider this is a bug in my application or API usage, but I'm struggling to see what could have gone wrong to cause this.

Any suggestions for things to try would be gratefully received, though I've not been able to make a simple reproduction of this case yet.

(2) By Mike (mike.mcternan) on 2021-05-21 16:37:38 in reply to 1.1 [link] [source]

From further testing, it looks like sqlite-amalgamation-3340100 doesn't crash, but sqlite-amalgamation-3350000 does segfault - in memjrnlWrite().

(3) By Richard Hipp (drh) on 2021-05-21 16:41:37 in reply to 1.1 [link] [source]

How can we recreate this problem? What queries are you running? What is your schema?

(4) By Mike (mike.mcternan) on 2021-05-22 21:18:49 in reply to 3 [link] [source]

Apologies, the schema is quite large and there are lots of parameters in the query which I think are all mostly noise so I've not shared it. I've tried making a reduction to demonstrate / recreate the problem, but not been sucessful. The query itself is mostly in the stack dump, but I think it's unremarkable (though my assesment is fairly unqualified), though it may be relevant that I'm performing lots of inserts.

So I've been trying to bisect Fossil changes to see where things start to go wrong.

This has proved difficult as the crash doesn't always reliably happen. It's actually a bit strange, because if I run say the 3.35.0 release until it crashes, it will then crash every time I re-attempt the same transaction. Going down through Fossil revisions will show the crash until I hit a revision which doesn't crash. And then at that point, if I go back to 3.35.0 it won't immediately crash again and I have to run for 20 minutes or longer to get it back into the 'crash every time' state so I can try a new bisection point. If I backup the sqlite files (-wal and -shm) when it is in 'crash every time' mode and restore them, it will make make the 'faulty' versions crash every time, if that makes sense.

Having done this a lot, I believe this change is the culprit and the area of the change fits the area of the segfault:

https://www.sqlite.org/cgi/src/info/23ca23894af352ea

Specifically, sources here and later will segfault after some time and then segfault every time the transaction is retried:

3.35.0 2021-02-23 16:40:47 23ca23894af352ea351c9efcdd7d86b82455f4c81b6001052a6d13aa2d70alt2

Sources from here, and preceeding changes, do not segfault in these tests:

3.35.0 2021-02-23 15:53:22 20689468100aed264877111367b42837ca19e63e717fed2ebd4b20b908f13178

I am using PRAGMA journal_mode=WAL and PRAGMA temp_store=2, in case it is relevant.

(5) By Dan Kennedy (dan) on 2021-05-24 15:00:18 in reply to 4 [source]

Thanks for reporting this. If you get the chance, can you confirm that it has been fixed on trunk?

https://sqlite.org/src/info/17960165f5840cab

Thanks,

Dan.

(6) By Mike (mike.mcternan) on 2021-05-24 15:32:38 in reply to 5 [link] [source]

Starting with a database that reliably segfaulted everytime, I updated and ran the following version and see the problem is fixed:

3.36.0 2021-05-24 14:35:19 17960165f5840cab45b7a8bb02779ebfb321c68f33ec6da9ab14063ccd134fa4

Many thanks for the fix and sorry I couldn't provide a small reproduction - thank you again for spending the time to figure it out, and your description on the change looks accurate to to the code I'm running too.