SQLite Forum

Runtime of and opcode counts of ANALYZE
Login

Runtime of and opcode counts of ANALYZE

(1) By sqlighter (macsql20) on 2021-03-14 17:30:39 [link] [source]

Platform: Windows, SQLite 3.31.0, 64 Bit

What I do

My software uses SQLite for a long time to manage databases of all sizes.

One feature in my software is a diagnosis routine which performs tons of logical checks on the data stored in the SQLite tables. And in a final step, the diagnosis code runs these two SQLite commands:

PRAGMA integrity_check(100);

followed by;

ANALYZE;

to let SQLite check the consistency of the database.

When it works correctly

My software runs the diagnosis in a background thread while showing a progress dialog to the user. To enable the user to abort the lengthy integrity_check / ANALYZE, I set a callback using sqlite3_progress_handler() with N=50,000 opcodes.

For smaller databases, these two commands take a few seconds. For larger database, they can take several minutes.

For example, a database with 20 GB on disk needs about 7 minutes to complete and the callback function is called 28,000 times. Which is fine.

The Problem

A customer reported that this step does never finish on his PC. He sent me the database today. It has 18 GB and I had to abort the ANALYZE step after waiting for more than one hour.

According to Windows Task Manager, the ''bytes read / bytes written'' did no longer increase. Or super-slow. ~ 5% CPU load for the process, on a 12 core system). It looked like SQLite was running in a tight loop or something.

After about one hour, the callback function reported a call count of whooping 971,855,609,341 calls! So, this seemed to be the problem.

Solution / Work-around

When I no longer set a callback function, the two functions take about 22 minutes to complete. Still a lot slower, but at least they finish successfully. On the other database, the time for the two steps stays roughly the same, with or without callback function set.

My Question

Is there anything I can do to figure out why ANALYZE would take such a massive amount of time and produce such a massive amount of calls to the callback function for this particular database? I have never experienced something similar, and I use SQLite for many years.

(2.1) By Larry Brasfield (larrybr) on 2021-03-14 18:41:55 edited from 2.0 in reply to 1 [link] [source]

Please accept my apology for stating the obvious here (if it is obvious to you.)

If we assume, (as I do), that the progress handler should have no side effects insofar as the SQLite library's operation is concerned, and takes negligible time compared to the work done by the library, then (it stands to reason that) we must focus on side-effects and execution time of your progress handler.1 Hence, I would suggest an incremental approach to figuring out where your handler goes awry, starting with an empty, do-nothing-quickly version and progressing (so to speak) to your fully fledged version that seems to misbehave. I would expect, as your progress handler fledges, that it will suddenly exhibit the problematic behavior you report, and at that increment of fledging you will gain a pretty good idea as to why, or at least of what needs intensive study for side-effects or pokiness.

Even if you dislike this trouble-shooting approach, something like it will be necessary to gain serious attention to your problem. As things stand now, your report can be paraphrased, "When I install my mystery progress callback, things go wrong such that my DB operation seems to never terminate." The natural question then is, "Well, what does that handler do?"

After about one hour, the callback function reported a call count of whooping 971,855,609,341 calls! So, this seemed to be the problem.

I suggest a slight perspective shift: That (whopping count) seems to a manifestation of the problem. Excessive calls may be "the problem" as far as you care, but if your code is the cause of that, then treating it as mere library misbehavior is going to unduly compromise your debugging progress.


1. I read your post as saying that the library operations proceed as you expect and wish when no progress callback is in place.

(3) By sqlighter (macsql20) on 2021-03-14 21:05:29 in reply to 2.1 [link] [source]

This is a callback function that is called by SQLite (!) while doing the ANALYZE operation.

The callback I've created for this test just increments a counter variable. SQLite calls the callback 28,000 times for database A. And for the database B with the problem 971,855,609,341 times! The databases are about the same size and with the same schema.

My customer reports that he let it run over night, without ANALYZE ever coming to an end. For this particular database, the runtime drops to 22 minutes without a callback. For other databases (even twice the size), the runtime is identical whether or not a callback is set. This is what puzzles me. I need to figure out what makes this database so special.

(4) By Joshua Randall (jrandall) on 2021-03-14 21:44:46 in reply to 3 [link] [source]

Please forgive me for essentially reiterating what larrybr said, but if your function does anything atall* I would also recommend his approach of incrementally adding in functionality.

It sounds like you are saying that it is very simple, which is great as that process should be pretty quick.

You've said that setting the handler to NULL makes the runtime much faster, and that your handler does have some functionality (it is doing something to increment the counter).

What happens if you register a minimal handler that literally has no body (i.e. all it does is return 0; - no variable declarations or actions of any kind)?

If you do that test and it is still slow, that would provide good evidence that we should start looking in SQLite itself. If not, it would be helpful to have you share the minimal handler body that reproduces the problem.

(5) By Larry Brasfield (larrybr) on 2021-03-14 23:06:37 in reply to 3 [link] [source]

What you are describing is inexplicable under the assumption that the same SQLite library code runs the same way with or without calling the callback, except for calling the callback. That's the way it ought to work, as long as there is no heap corruption, outside interference with SQLite's data structures that it maintains for connections, or stack corruption or unbounded growth. None of those preconditions can be safely assumed to exist when the behavior you describe occurs.

There need be nothing "special" about your unfortunate customer's database other than that, by being different, it helps expose a bug in somebody's code, either yours or SQLite's. It is a fact of programming in near-machine-level environments that programming errors can produce undefined behavior, and such behavior can include an absence of externally visible deleterious effects. Or, with a slight change to where a bad pointer write lands, the undefined behavior can become highly visible and clearly wrong. This is the perspective shift I am urging upon you. Do not insist that, just because you see one component acting badly, there must be a flaw in that component.

You say your "callback ... just increments a counter". I dare say it does some other things that you do not see in your C or C++ code (or whatever it is, as yet unstated.) For example, it returns, and that will involve stack cleanup on one or the other end of the call, determined by calling convention. If you misinform the compiler about that, (with a C-style cast or a reinterpret_cast in C++), you may get away with it sometimes and sometimes not.

You reiterate what is going wrong, without showing the only code which you claim makes all the difference in whether misbehavior occurs. Given that your callback is so simple, I suggest you share the code for it, all of it, including any macros that affect its definition and the statement(s) you use to install the progress callback.

It would be interesting to see what a debug build of SQLite does with your database under the same sequence of operations. Also interesting would be a build of your application that does occasional heap integrity checking, particularly around the time that the misbehaving ANALYZE operation is done.

I reiterate my advice to make the callback truly do nothing, and gradually make it do everything you want it to do. (That should not take long given its simplicity.) Even a do-nothing callback may expose a calling-convention mismatch that is blowing up in your problematic case.

And please, do just repeat your evidence, as if "971,855,609,341 times!" is the problem. It is a symptom of a problem you have not yet really begun to hunt down or provide any clues with which anybody else can guess where that hunt should proceed.

(8) By sqlighter (macsql20) on 2021-03-15 17:51:27 in reply to 5 [link] [source]

Thank you for taking the time to respond. Much appreciated. This is the callback function used for the test:

static uint64_t _phcount = 0;

static int _cdecl ProgressHandler_(void* pArgument) {
    _phcount++;
    return 0;
}

When the test is over, my application just logs the value of that counter to a log file. The callback is always the same, it is just called much more often for database B.

(13) By Larry Brasfield (larrybr) on 2021-03-15 19:38:19 in reply to 8 [link] [source]

One perhaps problematic issue I can see in this code would apply if you are compiling the SQLite library (the sqlite3.c amalgamation) with a default calling convention other than _cdecl. As you can see from the documented function signature and in sqlite3.h, the calling convention is not specified as far as the library is concerned, and so it follows the default for that compilation. Yet you are making your callback conform to the _cdecl (or old, C-style) calling convention. As I mentioned earlier, this differing calling convention can lead to problems.

I would expect this problem to have shown up at compilation time, unless you used a cast of some kind to tell the compiler to pretend the function pointer you give to sqlite3_progress_handler() is just what it expects. Given that I see no other reason your (extremely simple) callback should cause trouble, I would be inclined to see your call to sqlite3_progress_handler() and the compilation options used when compiling sqlite3.c.

(14) By sqlighter (macsql20) on 2021-03-17 08:11:35 in reply to 13 [link] [source]

The problem is that the callback is called 30,000 times for database A but billions of times for database B. Calling conventions or compiler flags don't change much in this scenario.

(15) By Larry Brasfield (larrybr) on 2021-03-17 11:10:25 in reply to 14 [link] [source]

Of course the calling convention does not change across those scenarios. But the effect of a calling convention mismatch could change. If neither caller nor callee remove the stack frame, as would happen with a _stdcall-assuming caller and a _cdecl-assuming callee, the resulting stack growth could be within its runtime limit in one case and not in the other.

I find it strange that you even need the _cdecl attribute. It's not in the published signature for the progress interface. And it's not necessary unless the attribute overrides the default for a compilation. That is what leads me to suspect that you are compiling with _stdcall (or something other than _cdecl) as the default.

(6) By Rowan Worth (sqweek) on 2021-03-15 05:38:24 in reply to 1 [link] [source]

What's the difference between the output if you run these commands in the sqlite3 shell on (a) the 20GB/7 minutes to analyze DB and (b) the 18GB/22 minutes to analyze DB?

.stats on
.timer on
ANALYZE;

(I guess also repeat this for PRAGMA integrity_check(100))

(9) By sqlighter (macsql20) on 2021-03-15 17:52:13 in reply to 6 [link] [source]

Thank. Good tip! When I have time tonight I will run these and also sqlite3_analyzer to figure out what the difference is between the two databases.

(7) By Richard Hipp (drh) on 2021-03-15 12:05:04 in reply to 1 [link] [source]

Is the excess time being used by integrity_check or ANALYZE?

My guess would be integrity_check, as that is an O(NlogN) operation whereas ANALYZE is O(N).

Have you considered running "PRAGMA quick_check(100)" instead of "PRAGMA integrity_check(100)"? The difference is that quick_check will not cross-check tables and indexes. In other words, quick_check does not verify that every table entry has all the correct index entries, and that every index entry corresponds to exactly one table entry. Quick_check verifies that each table and index is self-consistent, but it does not verify cross-consistency. The lack of cross-checking means that quick_check is able to run in O(N) time.

(10) By sqlighter (macsql20) on 2021-03-15 18:31:10 in reply to 7 [source]

Yes, you are correct.

I did the tests with sqlite3 for both databases and also used sqlite3_analyzer. I'm not sure which info is really relevant, so:

For PRAGMA integrity_check(100);

Database A:

Run Time: real 425.265 user 69.515625 sys 78.218750

Database B:

Run Time: real 716.021 user 352.218750 sys 132.031250

For ANALYZE;

Database A:

Run Time: real 75.830 user 6.234375 sys 12.093750

Database B:

Run Time: real 44.232 user 39.859375 sys 4.312500

sqlite3_analyzer:

NOTE: Two large tables were removed from the problem database B to make it transferable. It has over 50 GB in its original state.

Database A:

Pages in the whole file (measured)................ 20739558  
Pages in the whole file (calculated).............. 20739557  
Pages that store data............................. 20739557   100.000% 
Pages on the freelist (per header)................ 0            0.0% 
Pages on the freelist (calculated)................ 1            0.0% 
Pages of auto-vacuum overhead..................... 0            0.0% 
Number of tables in the database.................. 67        
Number of indices................................. 65        
Number of defined indices......................... 64        
Number of implied indices......................... 1         
Size of the file in bytes......................... 21237307392
Bytes of user payload stored...................... 18278789695  86.1% 
Database B:
Pages in the whole file (measured)................ 4285025   
Pages in the whole file (calculated).............. 4285024   
Pages that store data............................. 4284283     99.983% 
Pages on the freelist (per header)................ 741          0.017% 
Pages on the freelist (calculated)................ 742          0.017% 
Pages of auto-vacuum overhead..................... 0            0.0% 
Number of tables in the database.................. 62        
Number of indices................................. 63        
Number of defined indices......................... 62        
Number of implied indices......................... 1         
Size of the file in bytes......................... 17551462400
Bytes of user payload stored...................... 8154369696  46.5%

(12) By sqlighter (macsql20) on 2021-03-15 18:34:17 in reply to 10 [link] [source]

Deleted. Missed some info.

(11) By sqlighter (macsql20) on 2021-03-15 18:32:57 in reply to 7 [link] [source]

Yes, you are correct.

I did the tests with sqlite3 for both databases and also used sqlite3_analyzer. I'm not sure which info is really relevant, so:

For PRAGMA integrity_check(100);

Database A:

Virtual Machine Steps:               583825680
Run Time: real 425.265 user 69.515625 sys 78.218750

Database B:

Virtual Machine Steps:               -1358706478 <=========== ?
Run Time: real 716.021 user 352.218750 sys 132.031250

For ANALYZE;

Database A:

Virtual Machine Steps:               865130611
Run Time: real 75.830 user 6.234375 sys 12.093750

Database B:

Virtual Machine Steps:               327623562
Run Time: real 44.232 user 39.859375 sys 4.312500

sqlite3_analyzer:

NOTE: Two large tables were removed from the problem database B to make it transferable. It has over 50 GB in its original state.

Database A:

Pages in the whole file (measured)................ 20739558  
Pages in the whole file (calculated).............. 20739557  
Pages that store data............................. 20739557   100.000% 
Pages on the freelist (per header)................ 0            0.0% 
Pages on the freelist (calculated)................ 1            0.0% 
Pages of auto-vacuum overhead..................... 0            0.0% 
Number of tables in the database.................. 67        
Number of indices................................. 65        
Number of defined indices......................... 64        
Number of implied indices......................... 1         
Size of the file in bytes......................... 21237307392
Bytes of user payload stored...................... 18278789695  86.1% 
Database B:
Pages in the whole file (measured)................ 4285025   
Pages in the whole file (calculated).............. 4285024   
Pages that store data............................. 4284283     99.983% 
Pages on the freelist (per header)................ 741          0.017% 
Pages on the freelist (calculated)................ 742          0.017% 
Pages of auto-vacuum overhead..................... 0            0.0% 
Number of tables in the database.................. 62        
Number of indices................................. 63        
Number of defined indices......................... 62        
Number of implied indices......................... 1         
Size of the file in bytes......................... 17551462400
Bytes of user payload stored...................... 8154369696  46.5%

(16) By sqlighter (macsql20) on 2021-03-17 18:31:06 in reply to 7 [link] [source]

I'm using integrity_check(100) to ensure that the database is undamaged. The quick check runs faster, but does not check indices and so the check may miss database file damage in these areas.

I have added the check because users do all kind of crazy things, from wobbly external hard disks with crappy cables to NAS storage for the database to Wi-Fi access to the database NAS to abused USB sticks for database transfer...

The integrity_check is the best measure to check a database and inform users about issues before the "malformed on disk" panic message shows up suddenly when accessing less often used sections of the database.

I have included the stats for the good and super-slow database below. Sorry for the duplicate post. I missed some info in my first post and the "Delete" button for a post seems to do nothing?

Another finding: Running integrity_check in sqlite3 with .progress 50000 seems to call the callback only half as often as the callback is called in my code...

Maybe there is some setting or #pragma my software uses that causes this? I could not find anything in the docs, though.