Many hyperlinks are disabled.
Use anonymous login
to enable hyperlinks.
Overview
Comment: | Update comments in fts3.c to more accurately describe the doclist format. |
---|---|
Downloads: | Tarball | ZIP archive |
Timelines: | family | ancestors | descendants | both | trunk |
Files: | files | file ages | folders |
SHA1: |
e424a0307359fee6875424c10ecad1a1 |
User & Date: | drh 2010-01-08 23:01:33.000 |
Context
2010-01-09
| ||
07:33 | Fix handling of an OOM error in the fts3 offsets() function. Fix a couple of snippet related test cases in e_fts3.test. (check-in: 14dc46a74a user: dan tags: trunk) | |
2010-01-08
| ||
23:01 | Update comments in fts3.c to more accurately describe the doclist format. (check-in: e424a03073 user: drh tags: trunk) | |
04:50 | Added option to restore_jrnl.tcl utility to hex dump journal pages. (check-in: 08c545f030 user: shaneh tags: trunk) | |
Changes
Changes to ext/fts3/fts3.c.
︙ | ︙ | |||
19 20 21 22 23 24 25 | ** * The FTS3 module is being built as an extension ** (in which case SQLITE_CORE is not defined), or ** ** * The FTS3 module is being built into the core of ** SQLite (in which case SQLITE_ENABLE_FTS3 is defined). */ | < < < | 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | ** * The FTS3 module is being built as an extension ** (in which case SQLITE_CORE is not defined), or ** ** * The FTS3 module is being built into the core of ** SQLite (in which case SQLITE_ENABLE_FTS3 is defined). */ /* The full-text index is stored in a series of b+tree (-like) ** structures called segments which map terms to doclists. The ** structures are like b+trees in layout, but are constructed from the ** bottom up in optimal fashion and are not updatable. Since trees ** are built from the bottom up, things will be described from the ** bottom up. ** |
︙ | ︙ | |||
44 45 46 47 48 49 50 | ** B = 1xxxxxxx 7 bits of data and one flag bit ** ** 7 bits - A ** 14 bits - BA ** 21 bits - BBA ** and so on. ** | | > > > > > > > > > | > > > > > < < < < < | | | > > > > > > > > > > > > > > | 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | ** B = 1xxxxxxx 7 bits of data and one flag bit ** ** 7 bits - A ** 14 bits - BA ** 21 bits - BBA ** and so on. ** ** This is similar in concept to how sqlite encodes "varints" but ** the encoding is not the same. SQLite varints are big-endian ** are are limited to 9 bytes in length whereas FTS3 varints are ** little-endian and can be upt to 10 bytes in length (in theory). ** ** Example encodings: ** ** 1: 0x01 ** 127: 0x7f ** 128: 0x81 0x00 ** ** **** Document lists **** ** A doclist (document list) holds a docid-sorted list of hits for a ** given term. Doclists hold docids, and can optionally associate ** token positions and offsets with docids. A position is the index ** of a word within the document. The first word of the document has ** a position of 0. ** ** FTS3 used to optionally store character offsets using a compile-time ** option. But that functionality is no longer supported. ** ** A DL_POSITIONS_OFFSETS doclist is stored like this: ** ** array { ** varint docid; ** array { (position list for column 0) ** varint position; (delta from previous position plus POS_BASE) ** } ** array { ** varint POS_COLUMN; (marks start of position list for new column) ** varint column; (index of new column) ** array { ** varint position; (delta from previous position plus POS_BASE) ** } ** } ** varint POS_END; (marks end of positions for this document. ** } ** ** Here, array { X } means zero or more occurrences of X, adjacent in ** memory. A "position" is an index of a token in the token stream ** generated by the tokenizer. Note that POS_END and POS_COLUMN occur ** in the same logical place as the position element, and act as sentinals ** ending a position list array. POS_END is 0. POS_COLUMN is 1. ** The positions numbers are not stored literally but rather as two more ** the difference from the prior position, or the just the position plus ** 2 for the first position. Example: ** ** label: A B C D E F G H I J K ** value: 123 5 9 1 1 14 35 0 234 72 0 ** ** The 123 value is the first docid. For column zero in this document ** there are two matches at positions 3 and 10 (5-2 and 9-2+3). The 1 ** at D signals the start of a new column; the 1 at E indicates that the ** new column is column number 1. There are two positions at 12 and 45 ** (14-2 and 35-2+12). The 0 at H indicate the end-of-document. The ** 234 at I is the next docid. It has one position 72 (72-2) and then ** terminates with the 0 at K. ** ** A DL_POSITIONS doclist omits the startOffset and endOffset ** information. A DL_DOCIDS doclist omits both the position and ** offset information, becoming an array of varint-encoded docids. ** ** On-disk data is stored as type DL_DEFAULT, so we don't serialize ** the type. Due to how deletion is implemented in the segmentation |
︙ | ︙ | |||
384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 | z[iOut++] = z[iIn++]; } } z[iOut] = '\0'; } } static void fts3GetDeltaVarint(char **pp, sqlite3_int64 *pVal){ sqlite3_int64 iVal; *pp += sqlite3Fts3GetVarint(*pp, &iVal); *pVal += iVal; } static void fts3GetDeltaVarint2(char **pp, char *pEnd, sqlite3_int64 *pVal){ if( *pp>=pEnd ){ *pp = 0; }else{ fts3GetDeltaVarint(pp, pVal); } } | > > > > > > > > > > > | 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 | z[iOut++] = z[iIn++]; } } z[iOut] = '\0'; } } /* ** Read a single varint from the doclist at *pp and advance *pp to point ** to the next element of the varlist. Add the value of the varint ** to *pVal. */ static void fts3GetDeltaVarint(char **pp, sqlite3_int64 *pVal){ sqlite3_int64 iVal; *pp += sqlite3Fts3GetVarint(*pp, &iVal); *pVal += iVal; } /* ** As long as *pp has not reached its end (pEnd), then do the same ** as fts3GetDeltaVarint(): read a single varint and add it to *pVal. ** But if we have reached the end of the varint, just set *pp=0 and ** leave *pVal unchanged. */ static void fts3GetDeltaVarint2(char **pp, char *pEnd, sqlite3_int64 *pVal){ if( *pp>=pEnd ){ *pp = 0; }else{ fts3GetDeltaVarint(pp, pVal); } } |
︙ | ︙ | |||
777 778 779 780 781 782 783 | *ppCsr = pCsr = (sqlite3_vtab_cursor *)sqlite3_malloc(sizeof(Fts3Cursor)); if( !pCsr ){ return SQLITE_NOMEM; } memset(pCsr, 0, sizeof(Fts3Cursor)); return SQLITE_OK; } | < < < < < < | 808 809 810 811 812 813 814 815 816 817 818 819 820 821 | *ppCsr = pCsr = (sqlite3_vtab_cursor *)sqlite3_malloc(sizeof(Fts3Cursor)); if( !pCsr ){ return SQLITE_NOMEM; } memset(pCsr, 0, sizeof(Fts3Cursor)); return SQLITE_OK; } /* ** Close the cursor. For additional information see the documentation ** on the xClose method of the virtual table interface. */ static int fulltextClose(sqlite3_vtab_cursor *pCursor){ Fts3Cursor *pCsr = (Fts3Cursor *)pCursor; |
︙ | ︙ |