SQLite

Timeline
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

About branch approximate-analyze

This branch adds a new pragma, "PRAGMA analysis_limit=N", where N is a non-negative 32-bit integer. If N is zero, everything works as it always has. But if N is postive, then the ANALYZE command omits computing sqlite_stat4 and uses a sample of each index to compute sqlite_stat1. The size of the sample is usually N rows or less, but can be as large as 2*N in some cases.

Setting N to 100 or 200 will compute sqlite_stat1 much faster with results that are nearly as good. The idea here is that developers who do not run "PRAGMA optimize" on their databases because "it is too expensive" now have less of an excuse. Simply set "PRAGMA analysis_limit=200" prior to running "PRAGMA optimize" and then even hundred-megabyte databases with lots of indexes will run in milliseconds.

The analysis_limit works by only scanning a prefix of each index. The size of the prefix is initially N rows. If the left-most column of the index has not changed in the first N rows, then it skips ahead to the first row with a different left-most column and scans N more rows. The number of rows in the index is estimated by positioning the cursor on the first row and then taking the product of the number of cells on each page starting with the page on which the cursor is positioned and going back up the b-tree to the root page.

The result of an approximate analysis is usually pretty good - certainly good enough for query planning. But it can be off for misshapen indexes. For example, if the left most column of the first 100 rows of an index (with N=100) holds consecutive integers, but the remaining 1,000,000 rows are all 9999, then the approximate scan will think that most entries in the index are unique and that it is a very selective index, when in fact the index his horrid and should be avoided. This error could have been avoided, perhaps, by sampling multiple bands along the whole length of the index. But there could still be cases where the selective parts of the index happen to be sampled while the non-selective parts were not. The only certain way to know that you haven't missed a particular index value with many repeats is to read the entire index. For most cases, using N=100 or N=200 yields excellent statistics. For the rare counter-examples, one can always set N=0 and do a complete scan.

The computation of sqlite_stat4 requires reading the entire index, and so stat4 statistics gathering is disabled when N is positive.

24 check-ins related to "approximate-analyze"
2020-05-02
03:46
Add support for PRAGMA analyze_limit and approximate analysis. (check-in: 81254752 user: drh tags: trunk)
03:29
Add a missing VdbeCoverage() macro. (Closed-Leaf check-in: 77a55c39 user: drh tags: approximate-analyze)
2020-05-01
15:04
Merge recent trunk enhancements. (check-in: 2100b2c8 user: drh tags: approximate-analyze)
13:45
Clarification to the sqlite3_uri() family of interfaces. Documentation enhancement only - no changes to code. (check-in: 853703cd user: drh tags: trunk)
2020-04-09
15:01
Merge trunk enhancements into the approximate-analyze branch. (check-in: 17901ea6 user: drh tags: approximate-analyze)
2020-04-07
15:07
Limit LIKE/GLOB pattern length to 100 bytes (default is 50K) when running dbsql cases in the fuzzcheck utility. (check-in: 10306118 user: drh tags: trunk)
2020-04-01
17:10
Add support for the "ANALYZE WITH LIMIT=N" syntax. (Leaf check-in: 7e666b60 user: drh tags: analyze-with-limit)
01:15
Simple fixes to PRAGMA analysis_limit. (check-in: c20d4fde user: drh tags: approximate-analyze)
2020-03-31
20:57
Add "PRAGMA analysis_limit=N;" to limit the number of rows visited by ANALYZE when N is positive. Positive N also disables collecting stat4. (check-in: a279b151 user: drh tags: approximate-analyze)
18:41
Merge recent enhancements from trunk. (check-in: c705ce26 user: drh tags: approximate-analyze)
2020-03-30
13:35
Use __atomic_load_n() and __atomic_store_n() for a few more things where they are available. (check-in: a49f8ec5 user: dan tags: trunk)
2020-03-18
15:58
Fix the build for when SQLITE_ENABLE_STAT4 is defined. (check-in: 8f0a8c2a user: drh tags: approximate-analyze)
14:43
Add the SQLITE_ANALYZE_LIMIT compile-time option (expected to be temporary) that sets a threshold at which ANALYZE starts to use approximations during the analysis process. (check-in: a773fd46 user: drh tags: approximate-analyze)
2020-03-17
17:11
Provide an estimated row count to stat_init() for STAT1 analysis. (check-in: 714419fe user: drh tags: approximate-analyze)
13:41
Remove the SQLITE_OMIT_BTREECOUNT option. Btree count is required. (check-in: a9bfa47a user: drh tags: approximate-analyze)
13:07
Improved bytecode comments for the ANALYZE command. (check-in: c38ea413 user: drh tags: approximate-analyze)
12:37
Merge changes from trunk. (check-in: 93d71026 user: drh tags: approximate-analyze)
2020-03-16
18:52
Fix handling of window functions in aggregate queries that have no GROUP BY clause. Also remove a faulty assert causing the error reported in [618156e3]. (check-in: 38e3dd38 user: dan tags: trunk)
2020-03-11
20:03
Merge enhancements from trunk. (check-in: 4945a662 user: drh tags: approximate-analyze)
19:56
Do not factor out constant functions into the initialization section of a prepared statement, because even though they are constant, they can still throw exceptions. Instead, put such functions in an OP_Once block. This fixes ticket [3c9eadd2a6ba0aa5] and causes COALESCE() and CASE...END to be short-circuit. (check-in: c5f96a08 user: drh tags: trunk)
2020-03-10
22:55
Merge updates from trunk. (check-in: e05f743f user: drh tags: approximate-analyze)
19:24
Enhanced detection logic for preventing the use of static schema expressions by code generating routines. (check-in: 5f60b527 user: drh tags: trunk)
01:24
Background work for experiments trying to enhance ANALYZE so that it runs off of samples of the entire index and does not need to read the entire index. (check-in: 29d1cc5c user: drh tags: approximate-analyze)
2020-03-09
18:26
Cleaner separation of the STAT4-specific logic in the implementation of ANALYZE. (check-in: 3df07e5a user: drh tags: trunk)