SQLite Forum

Unexepected commit result with two threads application
Login

Unexepected commit result with two threads application

(1) By Max (Maxulite) on 2020-06-02 09:59:02 [link] [source]

Hi

I met an unexpected Sqlite behavior in my note-taking application on Windows. But I may have wrong assumptions about how locking is working so maybe a fix in the code is needed.

The application basically uses the same db in two threads. But each one opens a separate db connection (the same db physical path). 
- Thread 1 is about all general information retrieval and change (UI thread).
- Thread 2 is for quick search in the search box in order to show a popup list of the phrases met a search criteria while the user is typing. 

The problem is when the application want to commit in the main thread (after several Update statements), but quick type search is still searching, the overall consequence looks like a rollback effect: commit returns SQLITE_OK, no new changes in the data, the rollback file is gone and the date of the main db file is the one that was before this change was tried.  

I checked this with 3.32.01 version, Windows 7, 32 bit app 

Do I expect too much about auto-locking handling and probably still need some inter-thread locking mechanism despite the fact that the connections are different? 

Max

(2) By Rowan Worth (sqweek) on 2020-06-02 10:36:15 in reply to 1 [link] [source]

Hm nope your expectations are reasonable -- that should work fine using a separate connection for each thread.

How are you handling errors during COMMIT? Do you have a busy handler or a timeout configured?

In the default rollback journal mode, once the updates have run and the change is ready to commit¹, sqlite acquires the PENDING lock to indicate it is ready to modify the main db file, but it has to wait until all readers finish and relinquish their SHARED locks before it can acquire the EXCLUSIVE lock and actually write the new data.

If the busy timeout expires before the readers finish, then you will get an SQLITE_BUSY error. So if your policy on errors is to rollback then that would explain the behaviour you're seeing. The default configuration for a connection is no timeout, ie. you will immediately get SQLITE_BUSY if any readers are active. See https://www.sqlite.org/pragma.html#pragma_busy_timeout

¹ note this can also happen before commit time if your transaction is large enough to spill sqlite's memory cache; see https://www.sqlite.org/pragma.html#pragma_cache_spill

(3) By Max (Maxulite) on 2020-06-02 11:14:38 in reply to 2 [link] [source]

Thanks,

It looks like I have no busy handler, only greedy waiter (no sleeping), and I just looked several times, the main thread tried to do commit for update operations and got SQLITE_BUSY two times (sqlite3_step calls), after that SQLITE_DONE with the consequences I described. I checked this several times, no exception, every time the counter of SQLITE_BUSY results is always 2 before SQLITE_DONE. And very fast (subjectively)

I know that it's better to do a graceful waiting with my own sleep or little help from Sqlite, but I suppose the greedy waiter can't be the reason behind this issue.

(6.1) By Simon Slavin (slavin) on 2020-06-02 15:18:21 edited from 6.0 in reply to 3 [link] [source]

You should set a timeout on each connection. This will allow SQLite to handle SQLITE_BUSY internally, not bothering your software but handling the wait-and-retry loop itself. Two threads accessing the same database very quickly will definitely block each other occasionally and setting a timeout is a fast and simple way to solve the problem.

For testing purposes I recommend you set a timeout on each connection to a minute. This will allow you to figure out what your software is doing. Once you have a working system you might bring the value down to five seconds or so.

https://sqlite.org/c3ref/busy_timeout.html

(4) By Max (Maxulite) on 2020-06-02 13:36:51 in reply to 2 [link] [source]

Well, it looks like I need to require some additional knowledge about locking, transaction kinds and deadlocks when db is busy. But after some reading I have a question regarding SQLITE_BUSY. Is it safe to assume that any Sqlite function that returned SQLITE_BUSY should not be called second time?

(5.1) By Larry Brasfield (LarryBrasfield) on 2020-06-02 14:53:50 edited from 5.0 in reply to 4 [link] [source]

If the call which returned SQLITE_BUSY is from a different thread than the one which is using the DB exclusively, then retrying is indicated. Presumably, the busy state will end at some time. If not, or if the thread which creates the busy state is also making the call which returns SQLITE_BUSY, then you have a design flaw to correct.

(7) By Max (Maxulite) on 2020-06-02 15:30:05 in reply to 5.1 [link] [source]

Thanks, indeed according to lang_transaction.html

"SQLite attempts to undo just the one statement it was working on and leave changes from prior statements within the same transaction intact ... However, ... it might be necessary for SQLite to rollback. An application can tell which course of action SQLite took by using the sqlite3_get_autocommit()... "

It's interesting that in my case the codes and states are strange or at least make no sense to me.

COMMIT sqlite3_step in the main thread when supposedly another thread is still reading.

  • SQLITE_BUSY, sqlite3_get_autocommit(): 0
  • SQLITE_BUSY, sqlite3_get_autocommit(): 0
  • SQLITE_DONE, sqlite3_get_autocommit(): 1

I could stop trying if the step 2 gave me autocommit: 1, but there's no one, so the next step looked like success, but actually according to the data, led to a rollback

I still suspect that there might be something wrong with my code.

(9) By anonymous on 2020-06-03 08:13:13 in reply to 7 [link] [source]

SQLITE_BUSY, sqlite3_get_autocommit(): 0

It looks like you have explicitly started a transaction (which disables autocommit). The documentation for sqlite3_step() says:

SQLITE_BUSY means that the database engine was unable to acquire the database locks it needs to do its job. <...> If the statement is not a COMMIT and occurs within an explicit transaction then you should rollback the transaction before continuing.

Perhaps this SQLITE_BUSY shouldn't be retried?

(8) By Max (Maxulite) on 2020-06-02 16:50:01 in reply to 1 [link] [source]

As a last resort, I tried to make tests with different Sqlite versions (dll). And pinpointed the change to the version 3.17.0, so 3.16.2 and below correctly waits for the reading thread to finish, but 3.17.0 and above shows the symptoms I described elsewhere, including the win32 dll from the site.

The history information about 3.17.0 doesn't show anything interesting that might be related to locks or something, only "Perform some UPDATE operations in a single pass instead of in two passes" might be related. Also quickly scrolled through WinMerge difference between amalgamation sources, also didn't notice anything suspected given that I'm not the developer.

I wonder is there an easy way to download amalgamation betas chronologically between the releases? My "binary" search then will be able to find the change in question.

(10) By anonymous on 2020-06-03 08:14:42 in reply to 8 [link] [source]

I wonder is there an easy way to download amalgamation betas chronologically between the releases? My "binary" search then will be able to find the change in question.

fossil bisect, then?

(11.1) By Warren Young (wyoung) on 2020-06-03 11:49:57 edited from 11.0 in reply to 8 [source]

Steps:

  1. Clone SQLite using Fossil and open it into a check-out directory.

  2. Say:

    fossil bisect good version-3.16.2
    fossil bisect bad  version-3.17.0
    ./configure --with-whatever-args-you-need
    
  3. make sqlite3.c

  4. Copy sqlite3.c (the amalgamation) to wherever you build from normally, build and test your project, and make a verdict.

  5. Either say fossil bisect good or fossil bisect bad to record your verdict and get a new version to test.

  6. If Fossil reports that the bisect is complete, skip to next step; else goto 3.

  7. Say fossil status and send the result here.

However, realize that there's a fair chance that the behavior was changed on purpose and won't be changed back. Examining the commit comment for the version reported in step 7 or the commit diff might help you to understand which is the case.

(12) By Max (Maxulite) on 2020-06-03 15:05:50 in reply to 11.1 [link] [source]

Thanks, eventually not aware of the replies, made an inquiry and ended up with "manual" bisect by splitting sections at the timeline.

The change that triggers the difference is https://www.sqlite.org/cgi/src/info/2361b03b61311aab (Remove another unnecessary local variable initialization from sqlite3VdbeExec())

The code fragment in question is (the difference between versions is the line with my comment "THIS IS THE LINE ...". I looked at the latest sqlite, basically these particular lines are the same, only isInterrupted wrapped in AtomicLoad)

assert( p->rc==SQLITE_OK || (p->rc&0xff)==SQLITE_BUSY );
assert( p->bIsReader || p->readOnly!=0 );
p->rc = SQLITE_OK; // THIS IS THE LINE THAT TRIGGERS THE CHANGE
p->iCurrentTime = 0;
assert( p->explain==0 );
p->pResultSet = 0;
db->busyHandler.nBusy = 0;
if( db->u1.isInterrupted ) goto abort_due_to_interrupt;

I also noticed isInterrupted field here so I looked once more at my code and saw also an interrupt action so in order to clarify, the steps are

  • Thread1 (Main UI) asks the Thread2 Connection to sqlite3_interrupt, so the call is from Thread1, but for the connection of Thread2. This is because a new text is being entered so the previous query is no longer needed
  • Thread2 by reacting to an event triggered by the previous step starts a read-only query
  • Thread 1 begins deferred transaction, updates and calls Commit. If the line with p->rc = SQLITE_OK is omitted then sqlite3_step returns SQLITE_BUSY twice and returns SQLITE_DONE but failing to write. If the line is present then sqlite3_step allows calling it multiply times with SQLITE_BUSY result and correctly writes.

I'm not the developer so this is only my speculation. According to the first assert the p->rc can have different values (not only SQLITE_OK), but if the interruption (last line) is triggered its final state might be different in two Sqlite versions so this change may be an optimization with unnoticed side effects.

(13.2) By Richard Hipp (drh) on 2020-06-03 18:49:29 edited from 13.1 in reply to 12 [link] [source]

I have worked all day trying to reproduce this, but without success. Can you provide more detail, or perhaps a test case?

Never mind. I found a test case. Testing the patch now.

(14) By Max (Maxulite) on 2020-06-03 19:28:03 in reply to 13.0 [link] [source]

Thanks,
I will try to find something additional, but at the current moment I made a following test with logging
  ....
  if (p->rc != SQLITE_OK)
    sqlite3_log(..., "dif code: %u", p->rc);
  p->iCurrentTime = 0;
  assert( p->explain==0 );
  p->pResultSet = 0;
  db->busyHandler.nBusy = 0;
  if( AtomicLoad(&db->u1.isInterrupted) ) {
    sqlite3_log(..., "to interrupt: %u", p->rc);
    goto abort_due_to_interrupt;
  }
  ....

and the first log call (dif code) is triggered in the Thread1 (MainUI, on Commit sqlite3_step, I see this at the stack trace) with p->rc:SQLITE_BUSY and the second one (to interrupt) never called. So it's probably somewhere in the following code when the p->rc values difference might matter

(15) By Richard Hipp (drh) on 2020-06-03 19:33:39 in reply to 13.2 [link] [source]

(16) By Max (Maxulite) on 2020-06-03 20:58:35 in reply to 15 [link] [source]

The Check-in seems to work as expected (as a double check, I see version 3.33.0)

After reading the ticket, I suspect that this issue is probably non-actual for those who does waiting the right way, so without draining the CPU by multiply calls to sqlite3_step, and by using Sqlite recommended features like busy handler.

Although I noticed that for my long query that triggered this, the usual time was about 10 seconds and if the sqlite3_busy_timeout was above this, everything was ok, but if sqlite3_busy_timeout was 3 seconds, and I tried to repeat the step, the issue seemed to return. So my guess is the issue could affect ones who does the waiting the right way, but rely on small sleeping (either by Sqlite means or not), waking multiply times up to check or make the interface responsive.

(19) By Chris Brody (brodybits) on 2020-06-11 22:23:57 in reply to 16 [link] [source]

but if sqlite3_busy_timeout was 3 seconds, and I tried to repeat the step, the issue seemed to return

Do you mean that "the issue seemed to return" even with the check-in applied, or only without the bug fix?

(17) By Chris Brody (brodybits) on 2020-06-09 18:23:34 in reply to 15 [link] [source]

The following mutation does not seem to trigger any test failures for me (I tried running make test twice with this change):

Index: src/vdbe.c
==================================================================
--- src/vdbe.c
+++ src/vdbe.c
@@ -701,11 +701,10 @@
     ** sqlite3_column_text16() failed.  */
     goto no_mem;
   }
   assert( p->rc==SQLITE_OK || (p->rc&0xff)==SQLITE_BUSY );
   testcase( p->rc!=SQLITE_OK );
-  p->rc = SQLITE_OK;
   assert( p->bIsReader || p->readOnly!=0 );
   p->iCurrentTime = 0;
   assert( p->explain==0 );
   p->pResultSet = 0;
   db->busyHandler.nBusy = 0;

The test program in the ticket does seem to reproduce the issue with the above mutation.

(18) By Richard Hipp (drh) on 2020-06-09 18:37:08 in reply to 17 [link] [source]

Thanks. The public TCL tests to not verify this condition, but the proprietary TH3 tests do. So that should prevent a regression.

This particular problem is easier expressed in C code than in TCL code. And TH3 test cases can (optionally) be written in C, which is why I put the test case there.

Dan is a genius at writing innovative TCL test cases. He might go back in and add a case to the TCL test suite at some point. But even if he does not, the TH3 cases should prevent the problem from recurring.

(20) By Chris Brody (brodybits) on 2020-06-11 22:45:48 in reply to 18 [link] [source]

I made the test case below in TCL, which is partially based on the reproduction program in the ticket.

The test case demonstrates that if the bug fix is removed, the COMMIT would prematurely report SQLITE_OK, with the change not really committed.

I would love to see this or a similar test case added to the test suite and would be happy to sign a public domain statement, if needed.

# partially based on: test/lock.test

set testdir [file dirname $argv0]
source $testdir/tester.tcl

# Create an alternative connection to the database
#
do_test repeated-commit-busy-0 {
  sqlite3 db2 test.db
} {}

do_test repeated-commit-busy-1 {
  execsql {
    CREATE TABLE t1(a PRIMARY KEY, b);
  }

  execsql {
    INSERT INTO t1 VALUES(1, 'one');
    INSERT INTO t1 VALUES(2, 'two');
    INSERT INTO t1 VALUES(3, 'three');
  }

  execsql { SELECT * FROM t1 }
} {1 one 2 two 3 three}

do_test repeated-commit-busy-2 {
  set STMT2 [sqlite3_prepare db2 "SELECT * FROM t1" -1 TAIL]
  execsql { SELECT NULL } db2
} {{}}

do_test repeated-commit-busy-3 {
  set STMT1 [sqlite3_prepare db "COMMIT" -1 TAIL]
  execsql { SELECT NULL }
} {{}}

# start reading as if in thread 2
do_test repeated-commit-busy-4 {
  sqlite3_step $STMT2
} {SQLITE_ROW}

# update from thread 1,
# should be blocked as busy
do_test repeated-commit-busy-5 {
  execsql {
    BEGIN;
    INSERT INTO t1 VALUES(4, 'four');
  }
  sqlite3_step $STMT1
} {SQLITE_BUSY}

# should be blocked as busy over and over
for {set i 0} {$i < 10} {incr i} {
  do_test repeated-commit-busy-repeat-$i {
    sqlite3_step $STMT1
  } {SQLITE_BUSY}
}

# release the read lock
do_test repeated-commit-busy-finalize-1 {
  sqlite3_finalize $STMT2
} {SQLITE_OK}

# should now be able to commit the update
do_test repeated-commit-busy-finalize-2 {
  sqlite3_step $STMT1
} {SQLITE_DONE}

# and see the update
do_test repeated-commit-busy-final-check-1 {
  execsql { SELECT * FROM t1 }
} {1 one 2 two 3 three 4 four}