Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Fix up obsolete comments in FTS3 to conform to the latest nomenclature. Add new comments to better explain FTS3 operation. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
3e4a0082170155b5b779afd075a3ee65 |
User & Date: | drh 2010-03-23 15:46:41.000 |
Context
2010-03-23
| ||
18:24 | More commenting and documentation enhancements in FTS3. (check-in: 892e286709 user: drh tags: trunk) | |
15:46 | Fix up obsolete comments in FTS3 to conform to the latest nomenclature. Add new comments to better explain FTS3 operation. (check-in: 3e4a008217 user: drh tags: trunk) | |
15:29 | Close the auxiliary database db2 at the end of the crash8.test script. (check-in: 0fbdc431e8 user: drh tags: trunk) | |
Changes
Changes to ext/fts3/fts3.c.
︙ | ︙ | |||
44 45 46 47 48 49 50 | ** 14 bits - BA ** 21 bits - BBA ** and so on. ** ** This is similar in concept to how sqlite encodes "varints" but ** the encoding is not the same. SQLite varints are big-endian ** are are limited to 9 bytes in length whereas FTS3 varints are | | | | | | | | | | | | | > > < < | | | 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | ** 14 bits - BA ** 21 bits - BBA ** and so on. ** ** This is similar in concept to how sqlite encodes "varints" but ** the encoding is not the same. SQLite varints are big-endian ** are are limited to 9 bytes in length whereas FTS3 varints are ** little-endian and can be up to 10 bytes in length (in theory). ** ** Example encodings: ** ** 1: 0x01 ** 127: 0x7f ** 128: 0x81 0x00 ** ** **** Document lists **** ** A doclist (document list) holds a docid-sorted list of hits for a ** given term. Doclists hold docids and associated token positions. ** A docid is the unique integer identifier for a single document. ** A position is the index of a word within the document. The first ** word of the document has a position of 0. ** ** FTS3 used to optionally store character offsets using a compile-time ** option. But that functionality is no longer supported. ** ** A doclist is stored like this: ** ** array { ** varint docid; ** array { (position list for column 0) ** varint position; (2 more than the delta from previous position) ** } ** array { ** varint POS_COLUMN; (marks start of position list for new column) ** varint column; (index of new column) ** array { ** varint position; (2 more than the delta from previous position) ** } ** } ** varint POS_END; (marks end of positions for this document. ** } ** ** Here, array { X } means zero or more occurrences of X, adjacent in ** memory. A "position" is an index of a token in the token stream ** generated by the tokenizer. Note that POS_END and POS_COLUMN occur ** in the same logical place as the position element, and act as sentinals ** ending a position list array. POS_END is 0. POS_COLUMN is 1. ** The positions numbers are not stored literally but rather as two more ** than the difference from the prior position, or the just the position plus ** 2 for the first position. Example: ** ** label: A B C D E F G H I J K ** value: 123 5 9 1 1 14 35 0 234 72 0 ** ** The 123 value is the first docid. For column zero in this document ** there are two matches at positions 3 and 10 (5-2 and 9-2+3). The 1 ** at D signals the start of a new column; the 1 at E indicates that the ** new column is column number 1. There are two positions at 12 and 45 ** (14-2 and 35-2+12). The 0 at H indicate the end-of-document. The ** 234 at I is the next docid. It has one position 72 (72-2) and then ** terminates with the 0 at K. ** ** A "position-list" is the list of positions for multiple columns for ** a single docid. A "column-list" is the set of positions for a single ** column. Hence, a position-list consists of one or more column-lists, ** a document record consists of a docid followed by a position-list and ** a doclist consists of one or more document records. ** ** A bare doclist omits the position information, becoming an ** array of varint-encoded docids. ** **** Segment leaf nodes **** ** Segment leaf nodes store terms and doclists, ordered by term. Leaf ** nodes are written using LeafWriter, and read using LeafReader (to ** iterate through a single leaf node's data) and LeavesReader (to ** iterate through a segment's entire leaf layer). Leaf nodes have ** the format: |
︙ | ︙ | |||
309 310 311 312 313 314 315 316 317 318 319 320 321 322 | #include "fts3.h" #ifndef SQLITE_CORE # include "sqlite3ext.h" SQLITE_EXTENSION_INIT1 #endif /* ** Write a 64-bit variable-length integer to memory starting at p[0]. ** The length of data written will be between 1 and FTS3_VARINT_MAX bytes. ** The number of bytes written is returned. */ int sqlite3Fts3PutVarint(char *p, sqlite_int64 v){ unsigned char *q = (unsigned char *) p; | > > > > > > > > > > > > > > | 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 | #include "fts3.h" #ifndef SQLITE_CORE # include "sqlite3ext.h" SQLITE_EXTENSION_INIT1 #endif /* ** The testcase() macro is only used by the amalgamation. If undefined, ** make it a no-op. */ #ifndef testcase # define testcase(X) #endif /* ** Terminator values for position-lists and column-lists. */ #define POS_COLUMN (1) /* Column-list terminator */ #define POS_END (0) /* Position-list terminator */ /* ** Write a 64-bit variable-length integer to memory starting at p[0]. ** The length of data written will be between 1 and FTS3_VARINT_MAX bytes. ** The number of bytes written is returned. */ int sqlite3Fts3PutVarint(char *p, sqlite_int64 v){ unsigned char *q = (unsigned char *) p; |
︙ | ︙ | |||
355 356 357 358 359 360 361 | sqlite_int64 i; int ret = sqlite3Fts3GetVarint(p, &i); *pi = (int) i; return ret; } /* | | < | 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 | sqlite_int64 i; int ret = sqlite3Fts3GetVarint(p, &i); *pi = (int) i; return ret; } /* ** Return the number of bytes required to encode v as a varint */ int sqlite3Fts3VarintLen(sqlite3_uint64 v){ int i = 0; do{ i++; v >>= 7; }while( v!=0 ); |
︙ | ︙ | |||
407 408 409 410 411 412 413 | } z[iOut] = '\0'; } } /* ** Read a single varint from the doclist at *pp and advance *pp to point | | | 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 | } z[iOut] = '\0'; } } /* ** Read a single varint from the doclist at *pp and advance *pp to point ** to the first byte past the end of the varint. Add the value of the varint ** to *pVal. */ static void fts3GetDeltaVarint(char **pp, sqlite3_int64 *pVal){ sqlite3_int64 iVal; *pp += sqlite3Fts3GetVarint(*pp, &iVal); *pVal += iVal; } |
︙ | ︙ | |||
543 544 545 546 547 548 549 550 551 552 553 554 555 556 | return rc; } /* ** Create the backing store tables (%_content, %_segments and %_segdir) ** required by the FTS3 table passed as the only argument. This is done ** as part of the vtab xCreate() method. */ static int fts3CreateTables(Fts3Table *p){ int rc = SQLITE_OK; /* Return code */ int i; /* Iterator variable */ char *zContentCols; /* Columns of %_content table */ sqlite3 *db = p->db; /* The database connection */ | > > > > | 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 | return rc; } /* ** Create the backing store tables (%_content, %_segments and %_segdir) ** required by the FTS3 table passed as the only argument. This is done ** as part of the vtab xCreate() method. ** ** If the p->bHasDocsize boolean is true (indicating that this is an ** FTS4 table, not an FTS3 table) then also create the %_docsize and ** %_stat tables required by FTS4. */ static int fts3CreateTables(Fts3Table *p){ int rc = SQLITE_OK; /* Return code */ int i; /* Iterator variable */ char *zContentCols; /* Columns of %_content table */ sqlite3 *db = p->db; /* The database connection */ |
︙ | ︙ | |||
635 636 637 638 639 640 641 | /* ** This function is the implementation of both the xConnect and xCreate ** methods of the FTS3 virtual table. ** ** The argv[] array contains the following: ** | | | | | | | | | 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 | /* ** This function is the implementation of both the xConnect and xCreate ** methods of the FTS3 virtual table. ** ** The argv[] array contains the following: ** ** argv[0] -> module name ("fts3" or "fts4") ** argv[1] -> database name ** argv[2] -> table name ** argv[...] -> "column name" and other module argument fields. */ static int fts3InitVtab( int isCreate, /* True for xCreate, false for xConnect */ sqlite3 *db, /* The SQLite database connection */ void *pAux, /* Hash table containing tokenizers */ int argc, /* Number of elements in argv array */ const char * const *argv, /* xCreate/xConnect argument array */ sqlite3_vtab **ppVTab, /* Write the resulting vtab structure here */ char **pzErr /* Write any error message here */ ){ Fts3Hash *pHash = (Fts3Hash *)pAux; Fts3Table *p; /* Pointer to allocated vtab */ int rc; /* Return code */ int i; /* Iterator variable */ int nByte; /* Size of allocation used for *p */ int iCol; /* Column index */ int nString = 0; /* Bytes required to hold all column names */ int nCol = 0; /* Number of columns in the FTS table */ char *zCsr; /* Space for holding column names */ int nDb; /* Bytes required to hold database name */ int nName; /* Bytes required to hold table name */ const char *zTokenizer = 0; /* Name of tokenizer to use */ sqlite3_tokenizer *pTokenizer = 0; /* Tokenizer for this table */ nDb = (int)strlen(argv[1]) + 1; nName = (int)strlen(argv[2]) + 1; for(i=3; i<argc; i++){ |
︙ | ︙ | |||
889 890 891 892 893 894 895 896 897 898 899 900 901 902 | sqlite3Fts3ExprFree(pCsr->pExpr); sqlite3_free(pCsr->aDoclist); sqlite3_free(pCsr->aMatchinfo); sqlite3_free(pCsr); return SQLITE_OK; } static int fts3CursorSeek(sqlite3_context *pContext, Fts3Cursor *pCsr){ if( pCsr->isRequireSeek ){ pCsr->isRequireSeek = 0; sqlite3_bind_int64(pCsr->pStmt, 1, pCsr->iPrevId); if( SQLITE_ROW==sqlite3_step(pCsr->pStmt) ){ return SQLITE_OK; }else{ | > > > > > | 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 | sqlite3Fts3ExprFree(pCsr->pExpr); sqlite3_free(pCsr->aDoclist); sqlite3_free(pCsr->aMatchinfo); sqlite3_free(pCsr); return SQLITE_OK; } /* ** Position the pCsr->pStmt statement so that it is on the row ** of the %_content table that contains the last match. Return ** SQLITE_OK on success. */ static int fts3CursorSeek(sqlite3_context *pContext, Fts3Cursor *pCsr){ if( pCsr->isRequireSeek ){ pCsr->isRequireSeek = 0; sqlite3_bind_int64(pCsr->pStmt, 1, pCsr->iPrevId); if( SQLITE_ROW==sqlite3_step(pCsr->pStmt) ){ return SQLITE_OK; }else{ |
︙ | ︙ | |||
915 916 917 918 919 920 921 922 923 924 925 926 927 928 | return rc; } }else{ return SQLITE_OK; } } static int fts3NextMethod(sqlite3_vtab_cursor *pCursor){ int rc = SQLITE_OK; /* Return code */ Fts3Cursor *pCsr = (Fts3Cursor *)pCursor; if( pCsr->aDoclist==0 ){ if( SQLITE_ROW!=sqlite3_step(pCsr->pStmt) ){ pCsr->isEof = 1; | > > > > > > > > > > > | 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 | return rc; } }else{ return SQLITE_OK; } } /* ** Advance the cursor to the next row in the %_content table that ** matches the search criteria. For a MATCH search, this will be ** the next row that matches. For a full-table scan, this will be ** simply the next row in the %_content table. For a docid lookup, ** this routine simply sets the EOF flag. ** ** Return SQLITE_OK if nothing goes wrong. SQLITE_OK is returned ** even if we reach end-of-file. The fts3EofMethod() will be called ** subsequently to determine whether or not an EOF was hit. */ static int fts3NextMethod(sqlite3_vtab_cursor *pCursor){ int rc = SQLITE_OK; /* Return code */ Fts3Cursor *pCsr = (Fts3Cursor *)pCursor; if( pCsr->aDoclist==0 ){ if( SQLITE_ROW!=sqlite3_step(pCsr->pStmt) ){ pCsr->isEof = 1; |
︙ | ︙ | |||
1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 | *piPrev = iVal; } /* ** When this function is called, *ppPoslist is assumed to point to the ** start of a position-list. After it returns, *ppPoslist points to the ** first byte after the position-list. ** ** If pp is not NULL, then the contents of the position list are copied ** to *pp. *pp is set to point to the first byte past the last byte copied ** before this function returns. */ static void fts3PoslistCopy(char **pp, char **ppPoslist){ char *pEnd = *ppPoslist; char c = 0; /* The end of a position list is marked by a zero encoded as an FTS3 | > > > > > | | | | > > > > > > > > > > > > > > > > > > > > | > > | > > > | | | | | | > | | > > > > > > | | | | | | > | > > > > | | | | | | | | | > | | | | | | | | 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 | *piPrev = iVal; } /* ** When this function is called, *ppPoslist is assumed to point to the ** start of a position-list. After it returns, *ppPoslist points to the ** first byte after the position-list. ** ** A position list is list of positions (delta encoded) and columns for ** a single document record of a doclist. So, in other words, this ** routine advances *ppPoslist so that it points to the next docid in ** the doclist, or to the first byte past the end of the doclist. ** ** If pp is not NULL, then the contents of the position list are copied ** to *pp. *pp is set to point to the first byte past the last byte copied ** before this function returns. */ static void fts3PoslistCopy(char **pp, char **ppPoslist){ char *pEnd = *ppPoslist; char c = 0; /* The end of a position list is marked by a zero encoded as an FTS3 ** varint. A single POS_END (0) byte. Except, if the 0 byte is preceded by ** a byte with the 0x80 bit set, then it is not a varint 0, but the tail ** of some other, multi-byte, value. ** ** The following while-loop moves pEnd to point to the first byte that is not ** immediately preceded by a byte with the 0x80 bit set. Then increments ** pEnd once more so that it points to the byte immediately following the ** last byte in the position-list. */ while( *pEnd | c ){ c = *pEnd++ & 0x80; testcase( c!=0 && (*pEnd)==0 ); } pEnd++; /* Advance past the POS_END terminator byte */ if( pp ){ int n = (int)(pEnd - *ppPoslist); char *p = *pp; memcpy(p, *ppPoslist, n); p += n; *pp = p; } *ppPoslist = pEnd; } /* ** When this function is called, *ppPoslist is assumed to point to the ** start of a column-list. After it returns, *ppPoslist points to the ** to the terminator (POS_COLUMN or POS_END) byte of the column-list. ** ** A column-list is list of delta-encoded positions for a single column ** within a single document within a doclist. ** ** The column-list is terminated either by a POS_COLUMN varint (1) or ** a POS_END varint (0). This routine leaves *ppPoslist pointing to ** the POS_COLUMN or POS_END that terminates the column-list. ** ** If pp is not NULL, then the contents of the column-list are copied ** to *pp. *pp is set to point to the first byte past the last byte copied ** before this function returns. The POS_COLUMN or POS_END terminator ** is not copied into *pp. */ static void fts3ColumnlistCopy(char **pp, char **ppPoslist){ char *pEnd = *ppPoslist; char c = 0; /* A column-list is terminated by either a 0x01 or 0x00 byte that is ** not part of a multi-byte varint. */ while( 0xFE & (*pEnd | c) ){ c = *pEnd++ & 0x80; testcase( c!=0 && ((*pEnd)&0xfe)==0 ); } if( pp ){ int n = (int)(pEnd - *ppPoslist); char *p = *pp; memcpy(p, *ppPoslist, n); p += n; *pp = p; } *ppPoslist = pEnd; } /* ** Value used to signify the end of an position-list. This is safe because ** it is not possible to have a document with 2^31 terms. */ #define POSITION_LIST_END 0x7fffffff /* ** This function is used to help parse position-lists. When this function is ** called, *pp may point to the start of the next varint in the position-list ** being parsed, or it may point to 1 byte past the end of the position-list ** (in which case **pp will be a terminator bytes POS_END (0) or ** (1)). ** ** If *pp points past the end of the current position-list, set *pi to ** POSITION_LIST_END and return. Otherwise, read the next varint from *pp, ** increment the current value of *pi by the value read, and set *pp to ** point to the next value before returning. ** ** Before calling this routine *pi must be initialized to the value of ** the previous position, or zero if we are reading the first position ** in the position-list. Because positions are delta-encoded, the value ** of the previous position is needed in order to compute the value of ** the next position. */ static void fts3ReadNextPos( char **pp, /* IN/OUT: Pointer into position-list buffer */ sqlite3_int64 *pi /* IN/OUT: Value read from position-list */ ){ if( (**pp)&0xFE ){ fts3GetDeltaVarint(pp, pi); *pi -= 2; }else{ *pi = POSITION_LIST_END; } } /* ** If parameter iCol is not 0, write an POS_COLUMN (1) byte followed by ** the value of iCol encoded as a varint to *pp. This will start a new ** column list. ** ** Set *pp to point to the byte just after the last byte written before ** returning (do not modify it if iCol==0). Return the total number of bytes ** written (0 if iCol==0). */ static int fts3PutColNumber(char **pp, int iCol){ int n = 0; /* Number of bytes written */ if( iCol ){ char *p = *pp; /* Output pointer */ n = 1 + sqlite3Fts3PutVarint(&p[1], iCol); *p = 0x01; *pp = &p[n]; } return n; } /* ** Compute the union of two position lists. The output written ** into *pp contains all positions of both *pp1 and *pp2 in sorted ** order and with any duplicates removed. All pointers are ** updated appropriately. The caller is responsible for insuring ** that there is enough space in *pp to hold the complete output. */ static void fts3PoslistMerge( char **pp, /* Output buffer */ char **pp1, /* Left input list */ char **pp2 /* Right input list */ ){ char *p = *pp; char *p1 = *pp1; char *p2 = *pp2; while( *p1 || *p2 ){ int iCol1; /* The current column index in pp1 */ int iCol2; /* The current column index in pp2 */ if( *p1==POS_COLUMN ) sqlite3Fts3GetVarint32(&p1[1], &iCol1); else if( *p1==POS_END ) iCol1 = POSITION_LIST_END; else iCol1 = 0; if( *p2==POS_COLUMN ) sqlite3Fts3GetVarint32(&p2[1], &iCol2); else if( *p2==POS_END ) iCol2 = POSITION_LIST_END; else iCol2 = 0; if( iCol1==iCol2 ){ sqlite3_int64 i1 = 0; /* Last position from pp1 */ sqlite3_int64 i2 = 0; /* Last position from pp2 */ sqlite3_int64 iPrev = 0; int n = fts3PutColNumber(&p, iCol1); p1 += n; p2 += n; /* At this point, both p1 and p2 point to the start of column-lists ** for the same column (the column with index iCol1 and iCol2). ** A column-list is a list of non-negative delta-encoded varints, each ** incremented by 2 before being stored. Each list is terminated by a ** POS_END (0) or POS_COLUMN (1). The following block merges the two lists ** and writes the results to buffer p. p is left pointing to the byte ** after the list written. No terminator (POS_END or POS_COLUMN) is ** written to the output. */ fts3GetDeltaVarint(&p1, &i1); fts3GetDeltaVarint(&p2, &i2); do { fts3PutDeltaVarint(&p, &iPrev, (i1<i2) ? i1 : i2); iPrev -= 2; if( i1==i2 ){ fts3ReadNextPos(&p1, &i1); fts3ReadNextPos(&p2, &i2); }else if( i1<i2 ){ fts3ReadNextPos(&p1, &i1); }else{ fts3ReadNextPos(&p2, &i2); } }while( i1!=POSITION_LIST_END || i2!=POSITION_LIST_END ); }else if( iCol1<iCol2 ){ p1 += fts3PutColNumber(&p, iCol1); fts3ColumnlistCopy(&p, &p1); }else{ p2 += fts3PutColNumber(&p, iCol2); fts3ColumnlistCopy(&p, &p2); } } *p++ = POS_END; *pp = p; *pp1 = p1 + 1; *pp2 = p2 + 1; } /* ** nToken==1 searches for adjacent positions. |
︙ | ︙ |