SIGBUS errors in HHVM
(1) By Jeff Hemphil (jhemphill) on 2021-06-23 00:09:30 [link] [source]
Hi all. I work on HHVM, Facebook's PHP JIT compiler. We're currently on SQLite 3.30, and we've been noticing rare SIGBUS errors blaming within SQLite code. The stack trace tends to look like this:
sqlite3WalFindFrame.constprop.0
readDbPage
getPageNormal
moveToLeftmost
sqlite3VdbeExec
sqlite3_step
sqlite3_exec
HPHP::Facts::(anonymous namespace)::AutoloadDBImpl::analyze()
I'm simply executing a prepared ANALYZE
query statement. The DB is stored on a disk that has plenty of free space left.
I don't have a reproducible repro, but I'm mostly asking exploratory questions at this stage.
How can we prevent these SIGBUS errors in our application? I saw that, in 2012, this ticket in 2012 led to this bugfix in SQLite's code. Could this potentially be a bug in SQLite that we can help triage?
Thank you!
Jeff
(2) By Dan Kennedy (dan) on 2021-06-23 18:27:42 in reply to 1 [link] [source]
sqlite3WalFindFrame() accesses the *-shm file via an mmap() mapping. So one way this could happen is if some external program or malfunctioning SQLite library is truncating the *-shm file while it is being used.
In 2012, it looks like the SIGBUS happened because there was insufficient disk space, and Linux doesn't actually allocate the page until the first time it is accessed. So the first time SQLite accessed the page - SIGBUS. This can't happen if you use fallocate() (or a series of write() calls) to preallocate the pages.
After the crash happens, is the *-shm file left on disk zero bytes in size?
Dan.
(5) By Jeff Hemphil (jhemphill) on 2021-06-25 01:20:59 in reply to 2 [link] [source]
I'll get back to you when I have a reliable answer to this question😅
(6) By Jeff Hemphil (jhemphill) on 2021-07-12 20:08:05 in reply to 2 [link] [source]
Yes, the *-shm file left on disk is zero bytes in size after a SIGBUS error.
(3) By Rowan Worth (sqweek) on 2021-06-24 09:26:10 in reply to 1 [source]
The DB is stored on a disk that has plenty of free space left.
Is it a local or network disk? File i/o via mmap can result in SIGBUS if accessing a network filesystem that is experiencing problems.
(possibly this is a reflection of the more general issue that the mmap API does not provide any way for I/O errors to be reported)
(4) By Jeff Hemphil (jhemphill) on 2021-06-25 01:20:19 in reply to 3 [link] [source]
Our developers have their SQLite DBs on local disks running Btrfs. Developers do occasionally run into SQLITE_IOERR
errors when writing to disk.