/ Check-in [21bfd47c]
Login
SQLite training in Houston TX on 2019-11-05 (details)
Part of the 2019 Tcl Conference

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Change the estimated row counts in stat1 to be one-third worst-case and two-threads average case.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | analyze-worst-case
Files: files | file ages | folders
SHA1: 21bfd47c4245846b12aeeb7cf0212529e300b878
User & Date: drh 2016-03-01 14:31:59
Context
2016-03-04
18:45
Merge changes from trunk. check-in: 5294c977 user: drh tags: analyze-worst-case
2016-03-01
14:31
Change the estimated row counts in stat1 to be one-third worst-case and two-threads average case. check-in: 21bfd47c user: drh tags: analyze-worst-case
12:45
Fix test cases to align with the improved stats computation. check-in: 810967bf user: drh tags: analyze-worst-case
Changes
Hide Diffs Side-by-Side Diffs Ignore Whitespace Patch

Changes to src/analyze.c.

   824    824          || eCall==STAT_GET_NDLT 
   825    825     );
   826    826     if( eCall==STAT_GET_STAT1 )
   827    827   #else
   828    828     assert( argc==1 );
   829    829   #endif
   830    830     {
   831         -    /* Return the value to store in the "stat" column of the sqlite_stat1
          831  +    /* Return a value for the "stat" column of the sqlite_stat1
   832    832       ** table for this index.
   833    833       **
   834    834       ** The value is a string composed of a list of integers describing 
   835         -    ** the index. The first integer in the list is the total number of 
   836         -    ** entries in the index. There is one additional integer in the list 
   837         -    ** for each indexed column. This additional integer is an estimate of
   838         -    ** the number of rows matched by a query on the index using
   839         -    ** a key with the corresponding number of fields. In other words,
   840         -    ** if the index is on columns (a,b) and the sqlite_stat1 value is 
   841         -    ** "100 10 2", then SQLite estimates that:
          835  +    ** the index. The first integer is the (estimated) total number
          836  +    ** entries in the index. There is one additional integer for each
          837  +    ** column in the index.  The first added integer is an estimate of
          838  +    ** the number of rows that match a single key in the first column of
          839  +    ** the index.  The second added integer is an estimate on of the number
          840  +    ** of rows that match a single key consisting of the first two columns
          841  +    ** of the index.  And so forth.
          842  +    **
          843  +    ** For example, for an index on columns (a,b), if the sqlite_stat1.stat
          844  +    ** values is "100 10 2", that means there are about 100 rows in the
          845  +    ** index, and that a query against a=$key1 will match about 10 rows
          846  +    ** and a query against "a=$key1 AND b=$key2" will match about 2 rows.
          847  +    **
          848  +    ** Let V be the average number of rows that match a key, and let M
          849  +    ** be the most number of rows that match the key for any possible value
          850  +    ** of that key.  The estimate is computed as:
          851  +    **
          852  +    **     E = (2*V + M)/3
   842    853       **
   843         -    **   * the index contains 100 rows,
   844         -    **   * "WHERE a=?" matches 10 rows, and
   845         -    **   * "WHERE a=? AND b=?" matches 2 rows.
   846         -    **
   847         -    ** A worst-case estimate is used:  the maximum number of rows that
   848         -    ** could be select for any set of query parameters.  The worst case
   849         -    ** is the estimate we want for choosing indexes.
          854  +    ** Consider two indexes.  Index X has with 100 values of exactly 0 and 
          855  +    ** 100 singleton values between 1 and 100.  Index Y has 200 values
          856  +    ** evenly distributed between 1 and 20.  If only the average (V) is
          857  +    ** used in the estimate, X would have "200 2" and Y would have "200 10"
          858  +    ** and so the planner would think X is the more selective index.  And
          859  +    ** X often would be more selective.  But when searching for 0, index X
          860  +    ** would perform badly.  To avoid this problem, the M is added into the
          861  +    ** estimate so that the stat for X is "200 34" and Y is still "200 10".
          862  +    ** In this way, Y is the preferred index (all else being equal) and
          863  +    ** the pathological case is avoided.
   850    864       **
   851    865       ** For deciding whether or not to do a skip-scan, we want to know the
   852         -    ** average number of rows with the same key.  We can approximate this
   853         -    ** using the (worst case) most number of rows with the same key.  But
   854         -    ** sometimes that approximation can be badly off.  In those cases,
   855         -    ** mark the index as "noskipscan" to avoid suboptimal skip-scan plans.
          866  +    ** average number of rows (V) with the same key, not the mixed estimate
          867  +    ** E shown above.  Usually E will be close enough.  However, if E is
          868  +    ** large but V is small, that could trick the query planner into thinking
          869  +    ** that a skip-scan might work well on this index.  To avoid that, the
          870  +    ** "noskipscan" flag is added in cases where the divergence between E
          871  +    ** and V might mislead the query planner.
   856    872       */
   857    873       char *z;
   858    874       int i;
   859    875       int noSkipScan = 0;
   860    876   
   861    877       char *zRet = sqlite3MallocZero( (p->nKeyCol+2)*25 );
   862    878       if( zRet==0 ){
................................................................................
   863    879         sqlite3_result_error_nomem(context);
   864    880         return;
   865    881       }
   866    882   
   867    883       sqlite3_snprintf(24, zRet, "%llu", (u64)p->nRow);
   868    884       z = zRet + sqlite3Strlen30(zRet);
   869    885       for(i=0; i<p->nKeyCol; i++){
   870         -      u64 iVal = p->current.amxEq[i];
          886  +      u64 nDistinct = p->current.anDLt[i];
          887  +      u64 iMx = p->current.amxEq[i];                /* M: Most rows per key */
          888  +      u64 iAvg = (p->nRow+nDistinct)/(nDistinct+1); /* V: Average per key */
          889  +      u64 iVal = (iMx+iAvg*2)/3;                    /* E: The estimate */
   871    890         sqlite3_snprintf(24, z, " %llu", iVal);
   872    891         z += sqlite3Strlen30(z);
   873    892         assert( p->current.anEq[i] );
   874         -      if( iVal>=WHERE_SKIPSCAN_ONSET
   875         -       && p->current.anDLt[i] > p->nRow/(WHERE_SKIPSCAN_ONSET*2/3)
   876         -      ){
          893  +      if( iVal>=WHERE_SKIPSCAN_ONSET && iAvg<(WHERE_SKIPSCAN_ONSET*2/3) ){
   877    894           noSkipScan = 1;
   878    895         }
   879    896       }
   880    897       if( noSkipScan ) sqlite3_snprintf(14, z, " noskipscan");
   881    898       sqlite3_result_text(context, zRet, -1, sqlite3_free);
   882    899     }
   883    900   #ifdef SQLITE_ENABLE_STAT3_OR_STAT4

Changes to test/analyze.test.

   154    154   } {t1i1 {4 4} t1i2 {4 1} t1i3 {4 4 1}}
   155    155   do_test analyze-3.3 {
   156    156     execsql {
   157    157       INSERT INTO t1 VALUES(2,5);
   158    158       ANALYZE main;
   159    159       SELECT idx, stat FROM sqlite_stat1 ORDER BY idx;
   160    160     }
   161         -} {t1i1 {5 4} t1i2 {5 2} t1i3 {5 4 1}}
          161  +} {t1i1 {5 3} t1i2 {5 2} t1i3 {5 3 1}}
   162    162   do_test analyze-3.4 {
   163    163     execsql {
   164    164       CREATE TABLE t2 AS SELECT * FROM t1;
   165    165       CREATE INDEX t2i1 ON t2(a);
   166    166       CREATE INDEX t2i2 ON t2(b);
   167    167       CREATE INDEX t2i3 ON t2(a,b);
   168    168       ANALYZE;
   169    169       SELECT idx, stat FROM sqlite_stat1 ORDER BY idx;
   170    170     }
   171         -} {t1i1 {5 4} t1i2 {5 2} t1i3 {5 4 1} t2i1 {5 4} t2i2 {5 2} t2i3 {5 4 1}}
          171  +} {t1i1 {5 3} t1i2 {5 2} t1i3 {5 3 1} t2i1 {5 3} t2i2 {5 2} t2i3 {5 3 1}}
   172    172   do_test analyze-3.5 {
   173    173     execsql {
   174    174       DROP INDEX t2i3;
   175    175       ANALYZE t1;
   176    176       SELECT idx, stat FROM sqlite_stat1 ORDER BY idx;
   177    177     }
   178         -} {t1i1 {5 4} t1i2 {5 2} t1i3 {5 4 1} t2i1 {5 4} t2i2 {5 2}}
          178  +} {t1i1 {5 3} t1i2 {5 2} t1i3 {5 3 1} t2i1 {5 3} t2i2 {5 2}}
   179    179   do_test analyze-3.6 {
   180    180     execsql {
   181    181       ANALYZE t2;
   182    182       SELECT idx, stat FROM sqlite_stat1 ORDER BY idx;
   183    183     }
   184         -} {t1i1 {5 4} t1i2 {5 2} t1i3 {5 4 1} t2i1 {5 4} t2i2 {5 2}}
          184  +} {t1i1 {5 3} t1i2 {5 2} t1i3 {5 3 1} t2i1 {5 3} t2i2 {5 2}}
   185    185   do_test analyze-3.7 {
   186    186     execsql {
   187    187       DROP INDEX t2i2;
   188    188       ANALYZE t2;
   189    189       SELECT idx, stat FROM sqlite_stat1 ORDER BY idx;
   190    190     }
   191         -} {t1i1 {5 4} t1i2 {5 2} t1i3 {5 4 1} t2i1 {5 4}}
          191  +} {t1i1 {5 3} t1i2 {5 2} t1i3 {5 3 1} t2i1 {5 3}}
   192    192   do_test analyze-3.8 {
   193    193     execsql {
   194    194       CREATE TABLE t3 AS SELECT a, b, rowid AS c, 'hi' AS d FROM t1;
   195    195       CREATE INDEX t3i1 ON t3(a);
   196    196       CREATE INDEX t3i2 ON t3(a,b,c,d);
   197    197       CREATE INDEX t3i3 ON t3(d,b,c,a);
   198    198       DROP TABLE t1;
................................................................................
   201    201     }
   202    202   } {}
   203    203   do_test analyze-3.9 {
   204    204     execsql {
   205    205       ANALYZE;
   206    206       SELECT idx, stat FROM sqlite_stat1 ORDER BY idx;
   207    207     }
   208         -} {t3i1 {5 4} t3i2 {5 4 1 1 1} t3i3 {5 5 2 1 1}}
          208  +} {t3i1 {5 3} t3i2 {5 3 1 1 1} t3i3 {5 5 2 1 1}}
   209    209   
   210    210   do_test analyze-3.10 {
   211    211     execsql {
   212    212       CREATE TABLE [silly " name](a, b, c);
   213    213       CREATE INDEX 'foolish '' name' ON [silly " name](a, b);
   214    214       CREATE INDEX 'another foolish '' name' ON [silly " name](c);
   215    215       INSERT INTO [silly " name] VALUES(1, 2, 3);
   216    216       INSERT INTO [silly " name] VALUES(4, 5, 6);
   217    217       ANALYZE;
   218    218       SELECT idx, stat FROM sqlite_stat1 ORDER BY idx;
   219    219     }
   220         -} {{another foolish ' name} {2 1} {foolish ' name} {2 1 1} t3i1 {5 4} t3i2 {5 4 1 1 1} t3i3 {5 5 2 1 1}}
          220  +} {{another foolish ' name} {2 1} {foolish ' name} {2 1 1} t3i1 {5 3} t3i2 {5 3 1 1 1} t3i3 {5 5 2 1 1}}
   221    221   do_test analyze-3.11 {
   222    222     execsql {
   223    223       DROP INDEX "foolish ' name";
   224    224       SELECT idx, stat FROM sqlite_stat1 ORDER BY idx;
   225    225     }
   226         -} {{another foolish ' name} {2 1} t3i1 {5 4} t3i2 {5 4 1 1 1} t3i3 {5 5 2 1 1}}
          226  +} {{another foolish ' name} {2 1} t3i1 {5 3} t3i2 {5 3 1 1 1} t3i3 {5 5 2 1 1}}
   227    227   do_test analyze-3.11 {
   228    228     execsql {
   229    229       DROP TABLE "silly "" name";
   230    230       SELECT idx, stat FROM sqlite_stat1 ORDER BY idx;
   231    231     }
   232         -} {t3i1 {5 4} t3i2 {5 4 1 1 1} t3i3 {5 5 2 1 1}}
          232  +} {t3i1 {5 3} t3i2 {5 3 1 1 1} t3i3 {5 5 2 1 1}}
   233    233   
   234    234   # Try corrupting the sqlite_stat1 table and make sure the
   235    235   # database is still able to function.
   236    236   #
   237    237   do_test analyze-4.0 {
   238    238     sqlite3 db2 test.db
   239    239     db2 eval {
................................................................................
   245    245     db2 close
   246    246     db close
   247    247     sqlite3 db test.db
   248    248     execsql {
   249    249       ANALYZE;
   250    250       SELECT idx, stat FROM sqlite_stat1 ORDER BY idx;
   251    251     }
   252         -} {t3i1 {5 4} t3i2 {5 4 1 1 1} t3i3 {5 5 2 1 1} t4i1 {5 4} t4i2 {5 2}}
          252  +} {t3i1 {5 3} t3i2 {5 3 1 1 1} t3i3 {5 5 2 1 1} t4i1 {5 3} t4i2 {5 2}}
   253    253   do_test analyze-4.1 {
   254    254     execsql {
   255    255       PRAGMA writable_schema=on;
   256    256       INSERT INTO sqlite_stat1 VALUES(null,null,null);
   257    257       PRAGMA writable_schema=off;
   258    258     }
   259    259     db close