SQLite

Timeline
Login

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

About branch approximate-analyze

This branch adds a new pragma, "PRAGMA analysis_limit=N", where N is a non-negative 32-bit integer. If N is zero, everything works as it always has. But if N is postive, then the ANALYZE command omits computing sqlite_stat4 and uses a sample of each index to compute sqlite_stat1. The size of the sample is usually N rows or less, but can be as large as 2*N in some cases.

Setting N to 100 or 200 will compute sqlite_stat1 much faster with results that are nearly as good. The idea here is that developers who do not run "PRAGMA optimize" on their databases because "it is too expensive" now have less of an excuse. Simply set "PRAGMA analysis_limit=200" prior to running "PRAGMA optimize" and then even hundred-megabyte databases with lots of indexes will run in milliseconds.

The analysis_limit works by only scanning a prefix of each index. The size of the prefix is initially N rows. If the left-most column of the index has not changed in the first N rows, then it skips ahead to the first row with a different left-most column and scans N more rows. The number of rows in the index is estimated by positioning the cursor on the first row and then taking the product of the number of cells on each page starting with the page on which the cursor is positioned and going back up the b-tree to the root page.

The result of an approximate analysis is usually pretty good - certainly good enough for query planning. But it can be off for misshapen indexes. For example, if the left most column of the first 100 rows of an index (with N=100) holds consecutive integers, but the remaining 1,000,000 rows are all 9999, then the approximate scan will think that most entries in the index are unique and that it is a very selective index, when in fact the index his horrid and should be avoided. This error could have been avoided, perhaps, by sampling multiple bands along the whole length of the index. But there could still be cases where the selective parts of the index happen to be sampled while the non-selective parts were not. The only certain way to know that you haven't missed a particular index value with many repeats is to read the entire index. For most cases, using N=100 or N=200 yields excellent statistics. For the rare counter-examples, one can always set N=0 and do a complete scan.

The computation of sqlite_stat4 requires reading the entire index, and so stat4 statistics gathering is disabled when N is positive.

24 check-ins related to "approximate-analyze"
2020-05-02
03:46
Add support for PRAGMA analyze_limit and approximate analysis. (check-in: 812547525d user: drh tags: trunk)
03:29
Add a missing VdbeCoverage() macro. (Closed-Leaf check-in: 77a55c394d user: drh tags: approximate-analyze)
2020-05-01
15:04
Merge recent trunk enhancements. (check-in: 2100b2c8f3 user: drh tags: approximate-analyze)
13:45
Clarification to the sqlite3_uri() family of interfaces. Documentation enhancement only - no changes to code. (check-in: 853703cd6d user: drh tags: trunk)
2020-04-09
15:01
Merge trunk enhancements into the approximate-analyze branch. (check-in: 17901ea6a1 user: drh tags: approximate-analyze)
2020-04-07
15:07
Limit LIKE/GLOB pattern length to 100 bytes (default is 50K) when running dbsql cases in the fuzzcheck utility. (check-in: 10306118e8 user: drh tags: trunk)
2020-04-01
17:10
Add support for the "ANALYZE WITH LIMIT=N" syntax. (Leaf check-in: 7e666b6022 user: drh tags: analyze-with-limit)
01:15
Simple fixes to PRAGMA analysis_limit. (check-in: c20d4fdee2 user: drh tags: approximate-analyze)
2020-03-31
20:57
Add "PRAGMA analysis_limit=N;" to limit the number of rows visited by ANALYZE when N is positive. Positive N also disables collecting stat4. (check-in: a279b151c1 user: drh tags: approximate-analyze)
18:41
Merge recent enhancements from trunk. (check-in: c705ce266a user: drh tags: approximate-analyze)
2020-03-30
13:35
Use __atomic_load_n() and __atomic_store_n() for a few more things where they are available. (check-in: a49f8ec552 user: dan tags: trunk)
2020-03-18
15:58
Fix the build for when SQLITE_ENABLE_STAT4 is defined. (check-in: 8f0a8c2aa4 user: drh tags: approximate-analyze)
14:43
Add the SQLITE_ANALYZE_LIMIT compile-time option (expected to be temporary) that sets a threshold at which ANALYZE starts to use approximations during the analysis process. (check-in: a773fd4698 user: drh tags: approximate-analyze)
2020-03-17
17:11
Provide an estimated row count to stat_init() for STAT1 analysis. (check-in: 714419fe85 user: drh tags: approximate-analyze)
13:41
Remove the SQLITE_OMIT_BTREECOUNT option. Btree count is required. (check-in: a9bfa47aee user: drh tags: approximate-analyze)
13:07
Improved bytecode comments for the ANALYZE command. (check-in: c38ea4139d user: drh tags: approximate-analyze)
12:37
Merge changes from trunk. (check-in: 93d710262e user: drh tags: approximate-analyze)
2020-03-16
18:52
Fix handling of window functions in aggregate queries that have no GROUP BY clause. Also remove a faulty assert causing the error reported in [618156e3]. (check-in: 38e3dd389d user: dan tags: trunk)
2020-03-11
20:03
Merge enhancements from trunk. (check-in: 4945a66237 user: drh tags: approximate-analyze)
19:56
Do not factor out constant functions into the initialization section of a prepared statement, because even though they are constant, they can still throw exceptions. Instead, put such functions in an OP_Once block. This fixes ticket [3c9eadd2a6ba0aa5] and causes COALESCE() and CASE...END to be short-circuit. (check-in: c5f96a085d user: drh tags: trunk)
2020-03-10
22:55
Merge updates from trunk. (check-in: e05f743f61 user: drh tags: approximate-analyze)
19:24
Enhanced detection logic for preventing the use of static schema expressions by code generating routines. (check-in: 5f60b527b9 user: drh tags: trunk)
01:24
Background work for experiments trying to enhance ANALYZE so that it runs off of samples of the entire index and does not need to read the entire index. (check-in: 29d1cc5c36 user: drh tags: approximate-analyze)
2020-03-09
18:26
Cleaner separation of the STAT4-specific logic in the implementation of ANALYZE. (check-in: 3df07e5a9a user: drh tags: trunk)