SQLite: Check-in [381564e91b]

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview

Comment:	Add wiki documentation files for the spellfix1 virtual table.
Downloads:	Tarball \| ZIP archive
Timelines:	family \| ancestors \| descendants \| both \| trunk
Files:	files \| file ages \| folders
SHA1:	381564e91bbf619f99a48b0b7a94ac586cb9ee79
User & Date:	drh 2013-04-25 17:07:26.477

Context

2013-04-25
17:27		Fix the tool/build-shell.sh script to remove references to files that are now loadable extensions. (check-in: aabeea98f5 user: drh tags: trunk)
17:07		Add wiki documentation files for the spellfix1 virtual table. (check-in: 381564e91b user: drh tags: trunk)
16:52		Merge the std-ext branch into trunk. This merge adds several new extensions to the ext/misc folder, including transitive_closure, ieee754, and amatch, and it converts some older src/test_*.c file into extensions in the ext/misc folder. (check-in: bbe607c7d1 user: drh tags: trunk)

Changes

Added ext/misc/editdist3.wiki.

Added ext/misc/spellfix1.wiki.


















































































































1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114	+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +	<title>The editdist3 algorithm</title> The editdist3 algorithm is a function that computes the minimum edit distance (a.k.a. the Levenshtein distance) between two input strings. Features of editdist3 include: * It works with unicode (UTF8) text. * A table of insertion, deletion, and substitution costs can be provided by the application. * Multi-character insertsions, deletions, and substitutions can be enumerated in the cost table. <h2>The COST table</h2> To program the costs of editdist3, create a table such as the following: <blockquote><pre> CREATE TABLE editcost( iLang INT, -- The language ID cFrom TEXT, -- Convert text from this cTo TEXT, -- Convert text into this iCost INT -- The cost of doing the conversionnn ); </pre></blockquote> The cost table can be named anything you want - it does not have to be called "editcost". And the table can contain additional columns. However, it the table must contain the four columns show above, with exactly the names shown. The iLang column is a non-negative integer that identifies a set of costs appropriate for a particular language. The editdist3 function will only use a single iLang value for any given edit-distance computation. The default value is 0. It is recommended that applications that only need to use a single langauge always use iLang==0 for all entries. The iCost column is the numeric cost of transforming cFrom into cTo. This value should be a non-negative integer, and should probably be less than 100. The default single-character insertion and deletion costs are 100 and the default single-character to single-character substitution cost is 150. A cost of 10000 or more is considered "infinite" and causes the rule to be ignored. The cFrom and cTo columns show edit transformation strings. Either or both columns may contain more than one character. Or either column (but not both) may hold an empty string. When cFrom is empty, that is the cost of inserting cTo. When cTo is empty, that is the cost of deleting cFrom. In the spellfix1 algorithm, cFrom is the text as the user entered it and cTo is the correctly spelled text as it exists in the database. The goal of the editdist3 algorithm is to determine how close the user-entered text is to the dictionary text. There are three special-case entries in the cost table: <table border=1> <tr><th>cFrom</th><th>cTo</th><th>Meaning</th></tr> <tr><td>''</td><td>'?'</td><td>The default insertion cost</td></tr> <tr><td>'?'</td><td>''</td><td>The default deletion cost</td></tr> <tr><td>'?'</td><td>'?'</td><td>The default substitution cost</td></tr> </table> If any of the special-case entries shows above are omitted, then the value of 100 is used for insertion and deletion and 150 is used for substitution. To disable the default insertion, deletion, and/or substitution set their respective cost to 10000 or more. Other entries in the cost table specific transforms for particular characters. The cost of specific transforms should be less than the default costs, or else the default costs will take precedence and the specific transforms will never be used. Some example, cost table entries: <blockquote><pre> INSERT INTO editcost(iLang, cFrom, cTo, iCost) VALUES(0, 'a', 'ä', 5); </pre></blockquote> The rule above says that the letter "a" in user input can be matched against the letter "ä" in the dictionary with a penalty of 5. <blockquote><pre> INSERT INTO editcost(iLang, cFrom, cTo, iCost) VALUES(0, 'ss', 'ß', 8); </pre></blockquote> The number of characters in cFrom and cTo do not need to be the same. The rule above says that "ss" on user input will match "ß" with a penalty of 8. <h2>Experimenting with the editcost3() function</h2> The [./spellfix1.wiki \| spellfix1 virtual table] uses editdist3 if the "edit_cost_table=TABLE" option is specified as an argument when the spellfix1 virtual table is created. But editdist3 can also be tested directly using the built-in "editdist3()" SQL function. The editdist3() SQL function has 3 forms: 1. editdist3('TABLENAME'); 2. editdist3('string1', 'string2'); 3. editdist3('string1', 'string2', langid); The first form loads the edit distance coefficients from a table called 'TABLENAME'. Any prior coefficients are discarded. So when experimenting with weights and the weight table changes, simply rerun the single-argument form of editdist3() to reload revised coefficients. Note that the edit distance weights used by the editdist3() SQL function are independent from the weights used by the spellfix1 virtual table. The second and third forms return the computed edit distance between strings 'string1' and "string2'. In the second form, an language id of 0 is used. The language id is specified in the third form.