[ENOTABUG] Bug report: default journal_mode seems to delete journal without checking status

(1.3) By Gravis on 2022-11-11 23:03:12 edited from 1.2 [link] [source]

Platform: Linux (Debian)
Sqlite version: 3.39.4

From what I have deduced, the default journal_mode (DELETE?) will delete the journal while it's still needed if the status is SQLITE_BUSY (perhaps other's as well).

When in synchronous mode (default), sqlite3_step() periodically results in SQLITE_BUSY (even if file is stored on a RAM filesystem (tried both tmpfs and ramfs)). The default journal_mode then it will return SQLITE_CANTOPEN, presumably being busy delays the read resulting in the file being read after being deleted. At this point, even if you call sqlite3_step() again, it will continue to return SQLITE_CANTOPEN.

It's plausible that this is merely a bug in the unix vfs backend.

EDIT:
Turns out I was assuming that sqlite3_close() would never fail. It was in fact returning SQLITE_BUSY. While it was still busy, I opened the database again.  The sqlite3_open_v2 and sqlite3_prepare_v2 both succeed but upon calling sqlite3_step it will attempt to create an on disk journal file (which already exists) so it will return SQLITE_CANTOPEN.

(2) By Keith Medcalf (kmedcalf) on 2022-11-09 19:42:11 in reply to 1.0 [link] [source]

Cannot reproduce.

(3) By Dan Kennedy (dan) on 2022-11-09 19:50:34 in reply to 1.0 [link] [source]

The journal file is only needed for the duration of a transaction. If the journal mode is DELETE, then the journal should be deleted at the end of every transaction.

SQLITE_CANTOPEN usually occurs either because SQLite is failing to create a new file, or the permissions on an existing one preclude opening it. You can probably figure out which by running under strace and looking for a failed open(2) call.

But I don't think I completely understand the middle paragraph. Can you post source code or pseudo code for us?

Dan.

(4) By Rowan Worth (sqweek) on 2022-11-10 05:05:39 in reply to 1.0 [link] [source]

Having used sqlite over nfs with the default journal_mode for over a decade in production, with many concurrent users and lots of SQLITE_BUSY juggling, I'm highly suspicious of the claim that such a fundamental bug exists in the software.

To be fair our deployment is quite old (sqlite 3.8.7.4) so it's not impossible that a new bug has been introduced. But I think you need to track down the actual cause of SQLITE_CANTOPEN to back up your claim rather than presuming an explanation that fits your hypothesis :)

SQLITE_BUSY represents (a) database contention ie. the file is locked by another process, OR (b) an attempt to COMMIT while a write-operation is in progress eg. an sqlite3_stmt representing an UPDATE/INSERT/DELETE has not been finalized or run to completion (ie. sqlite3_step() returning SQLITE_DONE).

The main time I've seen SQLITE_CANTOPEN on the journal file during COMMIT is when a database has been deleted and recreated but something still has the deleted DB open. Then you end up with two instances of the journal file with the same path on disk but which belong to different databases -- one refers to the newly created DB and the other refers to the deleted inode, and they interfere with each other's journals.

(5) By anonymous on 2022-11-10 14:56:34 in reply to 2 [link] [source]

Not surprising.

I couldn't for the life of me make a minimal demo to show this flaw. Here is the whole project which is a website scraper but it'll assert in a few minutes.

Commands to build:

git clone --recurse-submodules -b sqlitebug git@github.com:GravisZro/gpsscraper.git sqlitebug
cd sqlitebug
make

(6) By anonymous on 2022-11-10 16:33:31 in reply to 3 [link] [source]

details in this post: https://sqlite.org/forum/forumpost/2a291fb19d2894fb

(7) By Rowan Worth (sqweek) on 2022-11-11 02:43:55 in reply to 5 [link] [source]

From a logical perspective your access pattern here appears to be repeated DB connections which are opened for a short period and then closed. In this mode of operation there's no reason you should ever hit SQLITE_BUSY (unless you are running multiple scraper processes in parallel, which I can't tell form the available information).

The fact that you are likely points to an object lifetime related bug in your custom bindings. I note that you rely on destructors to close/finalize/cleanup as your queries/DB instances go out of scope, which makes sense but also means that there's no error handling on these pathways eg. if sqlite3_close returns SQLITE_BUSY during the destructor call this will manifest as a silent failure from the application perspective and the relevant connection will never be closed/cleaned up. I suspect some logging around these destructors will reveal something is not operating as you expect.

(8) By Gravis on 2022-11-11 14:34:38 in reply to 7 [source]

"there's no error handling on these pathways eg. if sqlite3_close returns SQLITE_BUSY during the destructor call this will manifest as a silent failure"

This was it! sqlite3_close was returning SQLITE_BUSY (a lot). The bug was mine all along! (Thanks!)

"In this mode of operation there's no reason you should ever hit SQLITE_BUSY (unless you are running multiple scraper processes in parallel"

There are no parallel operations in the scraper but I still get SQLITE_BUSY.

"your access pattern here appears to be repeated DB connections which are opened for a short period and then closed."

A poor design based on assumptions. It has been rectified.