The editdist3 algorithm

ADDED   ext/misc/editdist3.wiki
Index: ext/misc/editdist3.wiki
==================================================================
--- /dev/null
+++ ext/misc/editdist3.wiki
@@ -0,0 +1,114 @@
+<title>The editdist3 algorithm</title>
+
+The editdist3 algorithm is a function that computes the minimum edit distance
+(a.k.a. the Levenshtein distance) between two input strings.  Features of
+editdist3 include:
+
+   *   It works with unicode (UTF8) text.
+
+   *   A table of insertion, deletion, and substitution costs can be 
+       provided by the application.
+
+   *   Multi-character insertsions, deletions, and substitutions can be
+       enumerated in the cost table.
+
+<h2>The COST table</h2>
+
+To program the costs of editdist3, create a table such as the following:
+
+<blockquote><pre>
+CREATE TABLE editcost(
+  iLang INT,   -- The language ID
+  cFrom TEXT,  -- Convert text from this
+  cTo   TEXT,  -- Convert text into this
+  iCost INT    -- The cost of doing the conversionnn
+);
+</pre></blockquote>
+
+The cost table can be named anything you want - it does not have to be called
+"editcost".  And the table can contain additional columns.  However, it the
+table must contain the four columns show above, with exactly the names shown.
+
+The iLang column is a non-negative integer that identifies a set of costs
+appropriate for a particular language.  The editdist3 function will only use
+a single iLang value for any given edit-distance computation.  The default
+value is 0.  It is recommended that applications that only need to use a
+single langauge always use iLang==0 for all entries.
+
+The iCost column is the numeric cost of transforming cFrom into cTo.  This
+value should be a non-negative integer, and should probably be less than 100.
+The default single-character insertion and deletion costs are 100 and the
+default single-character to single-character substitution cost is 150.  A
+cost of 10000 or more is considered "infinite" and causes the rule to be
+ignored.
+
+The cFrom and cTo columns show edit transformation strings.  Either or both
+columns may contain more than one character.  Or either column (but not both)
+may hold an empty string.  When cFrom is empty, that is the cost of inserting
+cTo.  When cTo is empty, that is the cost of deleting cFrom.
+
+In the spellfix1 algorithm, cFrom is the text as the user entered it and
+cTo is the correctly spelled text as it exists in the database.  The goal
+of the editdist3 algorithm is to determine how close the user-entered text is
+to the dictionary text.
+
+There are three special-case entries in the cost table:
+
+<table border=1>
+<tr><th>cFrom</th><th>cTo</th><th>Meaning</th></tr>
+<tr><td>''</td><td>'?'</td><td>The default insertion cost</td></tr>
+<tr><td>'?'</td><td>''</td><td>The default deletion cost</td></tr>
+<tr><td>'?'</td><td>'?'</td><td>The default substitution cost</td></tr>
+</table>
+
+If any of the special-case entries shows above are omitted, then the
+value of 100 is used for insertion and deletion and 150 is used for
+substitution.  To disable the default insertion, deletion, and/or substitution
+set their respective cost to 10000 or more.
+
+Other entries in the cost table specific transforms for particular characters.
+The cost of specific transforms should be less than the default costs, or else
+the default costs will take precedence and the specific transforms will never 
+be used.
+
+Some example, cost table entries:
+
+<blockquote><pre>
+INSERT INTO editcost(iLang, cFrom, cTo, iCost)
+VALUES(0, 'a', 'ä', 5);
+</pre></blockquote>
+
+The rule above says that the letter "a" in user input can be matched against
+the letter "ä" in the dictionary with a penalty of 5.
+
+<blockquote><pre>
+INSERT INTO editcost(iLang, cFrom, cTo, iCost)
+VALUES(0, 'ss', 'ß', 8);
+</pre></blockquote>
+
+The number of characters in cFrom and cTo do not need to be the same.  The
+rule above says that "ss" on user input will match "ß" with a penalty of 8.
+
+<h2>Experimenting with the editcost3() function</h2>
+
+The [./spellfix1.wiki | spellfix1 virtual table]
+uses editdist3 if the "edit_cost_table=TABLE" option
+is specified as an argument when the spellfix1 virtual table is created.  
+But editdist3 can also be tested directly using the built-in "editdist3()"
+SQL function.  The editdist3() SQL function has 3 forms:
+
+  1.  editdist3('TABLENAME');
+  2.  editdist3('string1', 'string2');
+  3.  editdist3('string1', 'string2', langid);
+
+The first form loads the edit distance coefficients from a table called
+'TABLENAME'.  Any prior coefficients are discarded.  So when experimenting
+with weights and the weight table changes, simply rerun the single-argument
+form of editdist3() to reload revised coefficients.  Note that the 
+edit distance
+weights used by the editdist3() SQL function are independent from the
+weights used by the spellfix1 virtual table.
+
+The second and third forms return the computed edit distance between strings
+'string1' and "string2'.  In the second form, an language id of 0 is used.
+The language id is specified in the third form.

ADDED   ext/misc/spellfix1.wiki
Index: ext/misc/spellfix1.wiki
==================================================================
--- /dev/null
+++ ext/misc/spellfix1.wiki
@@ -0,0 +1,464 @@
+<title>The Spellfix1 Virtual Table</title>
+
+This spellfix1 virtual table is used to search
+a large vocabulary for close matches.  For example, spellfix1
+can be used to suggest corrections to misspelled words.  Or,
+it could be used with FTS4 to do full-text search using potentially
+misspelled words.
+
+Create an instance of the spellfix1 virtual table like this:
+
+<blockquote><pre>
+CREATE VIRTUAL TABLE demo USING spellfix1;
+</pre></blockquote>
+
+The "spellfix1" term is the name of this module and must be entered as
+shown.  The "demo" term is the
+name of the virtual table you will be creating and can be altered
+to suit the needs of your application.  The virtual table is initially
+empty.  In order for the virtual table to be useful, you will need to
+populate it with your vocabulary.  Suppose you
+have a list of words in a table named "big_vocabulary".  Then do this:
+
+<blockquote><pre>
+INSERT INTO demo(word) SELECT word FROM big_vocabulary;
+</pre></blockquote>
+
+If you intend to use this virtual table in cooperation with an FTS4
+table (for spelling correctly of search terms) then you might extract
+the vocabulary using an fts3aux table:
+
+<blockquote><pre>
+INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*';
+</pre></blockquote>
+
+You can also provide the virtual table with a "rank" for each word.
+The "rank" is an estimate of how common the word is.  Larger numbers
+mean the word is more common.  If you omit the rank when populating
+the table, then a rank of 1 is assumed.  But if you have rank 
+information, you can supply it and the virtual table will show a
+slight preference for selecting more commonly used terms.  To
+populate the rank from an fts4aux table "search_aux" do something
+like this:
+
+<blockquote><pre>
+INSERT INTO demo(word,rank)
+   SELECT term, documents FROM search_aux WHERE col='*';
+</pre></blockquote>
+
+To query the virtual table, include a MATCH operator in the WHERE
+clause.  For example:
+
+<blockquote><pre>
+SELECT word FROM demo WHERE word MATCH 'kennasaw';
+</pre></blockquote>
+
+Using a dataset of American place names (derived from
+[http://geonames.usgs.gov/domestic/download_data.htm]) the query above
+returns 20 results beginning with:
+
+<blockquote><pre>
+kennesaw
+kenosha
+kenesaw
+kenaga
+keanak
+</pre></blockquote>
+
+If you append the character '*' to the end of the pattern, then
+a prefix search is performed.  For example:
+
+<blockquote><pre>
+SELECT word FROM demo WHERE word MATCH 'kennes*';
+</pre></blockquote>
+
+Yields 20 results beginning with:
+
+<blockquote><pre>
+kennesaw
+kennestone
+kenneson
+kenneys
+keanes
+keenes
+</pre></blockquote>
+
+<h2>Search Refinements</h2>
+
+By default, the spellfix1 table returns no more than 20 results.
+(It might return less than 20 if there were fewer good matches.)
+You can change the upper bound on the number of returned rows by
+adding a "top=N" term to the WHERE clause of your query, where N
+is the new maximum.  For example, to see the 5 best matches:
+
+<blockquote><pre>
+SELECT word FROM demo WHERE word MATCH 'kennes*' AND top=5;
+</pre></blockquote>
+
+Each entry in the spellfix1 virtual table is associated with a
+a particular language, identified by the integer "langid" column.
+The default langid is 0 and if no other actions are taken, the
+entire vocabulary is a part of the 0 language.  But if your application
+needs to operate in multiple languages, then you can specify different
+vocabulary items for each language by specifying the langid field
+when populating the table.  For example:
+
+<blockquote><pre>
+INSERT INTO demo(word,langid) SELECT word, 0 FROM en_vocabulary;
+INSERT INTO demo(word,langid) SELECT word, 1 FROM de_vocabulary;
+INSERT INTO demo(word,langid) SELECT word, 2 FROM fr_vocabulary;
+INSERT INTO demo(word,langid) SELECT word, 3 FROM ru_vocabulary;
+INSERT INTO demo(word,langid) SELECT word, 4 FROM cn_vocabulary;
+</pre></blockquote>
+
+After the virtual table has been populated with items from multiple
+languages, specify the language of interest using a "langid=N" term
+in the WHERE clause of the query:
+
+<blockquote><pre>
+SELECT word FROM demo WHERE word MATCH 'hildes*' AND langid=1;
+</pre></blockquote>
+
+Note that if you do not include the "langid=N" term in the WHERE clause,
+the search will be against language 0 (English in the example above.)
+All spellfix1 searches are against a single language id.  There is no
+way to search all languages at once.
+ 
+
+<h2>Virtual Table Details</h2>
+
+The virtual table actually has a unique rowid with seven columns plus five
+extra hidden columns.  The columns are as follows:
+
+<blockquote><dl>
+<dt><b>rowid</b><dd>
+A unique integer number associated with each
+vocabulary item in the table.  This can be used
+as a foreign key on other tables in the database.
+
+<dt><b>word</b><dd>
+The text of the word that matches the pattern.
+Both word and pattern can contains unicode characters
+and can be mixed case.
+
+<dt><b>rank</b><dd>
+This is the rank of the word, as specified in the
+original INSERT statement.
+
+
+<dt><b>distance</b><dd>
+This is an edit distance or Levensthein distance going
+from the pattern to the word.
+
+<dt><b>langid</b><dd>
+This is the language-id of the word.  All queries are
+against a single language-id, which defaults to 0.
+For any given query this value is the same on all rows.
+
+<dt><b>score</b><dd>
+The score is a combination of rank and distance.  The
+idea is that a lower score is better.  The virtual table
+attempts to find words with the lowest score and 
+by default (unless overridden by ORDER BY) returns
+results in order of increasing score.
+
+<dt><b>matchlen</b><dd>
+In a prefix search, the matchlen is the number of characters in
+the string that match against the prefix.  For a non-prefix search,
+this is the same as length(word).
+
+<dt><b>phonehash</b><dd>
+This column shows the phonetic hash prefix that was used to restrict
+the search.  For any given query, this column should be the same for
+every row.  This information is available for diagnostic purposes and
+is not normally considered useful in real applications.
+
+<dt><b>top</b><dd>
+(HIDDEN)  For any query, this value is the same on all
+rows.  It is an integer which is the maximum number of
+rows that will be output.  The actually number of rows
+output might be less than this number, but it will never
+be greater.  The default value for top is 20, but that
+can be changed for each query by including a term of
+the form "top=N" in the WHERE clause of the query.
+
+<dt><b>scope</b><dd>
+(HIDDEN)  For any query, this value is the same on all
+rows.  The scope is a measure of how widely the virtual
+table looks for matching words.  Smaller values of
+scope cause a broader search.  The scope is normally
+choosen automatically and is capped at 4.  Applications
+can change the scope by including a term of the form
+"scope=N" in the WHERE clause of the query.  Increasing
+the scope will make the query run faster, but will reduce
+the possible corrections.
+
+<dt><b>srchcnt</b><dd>
+(HIDDEN)  For any query, this value is the same on all
+rows.  This value is an integer which is the number of
+of words examined using the edit-distance algorithm to
+find the top matches that are ultimately displayed.  This
+value is for diagnostic use only.
+
+<dt><b>soundslike</b><dd>
+(HIDDEN)  When inserting vocabulary entries, this field
+can be set to an spelling that matches what the word
+sounds like.  See the DEALING WITH UNUSUAL AND DIFFICULT
+SPELLINGS section below for details.
+
+<dt><b>command</b><dd>
+(HIDDEN)  The value of the "command" column is always NULL.  However,
+applications can insert special strings into the "command" column in order
+to provoke certain behaviors in the spellfix1 virtual table.
+For example, inserting the string 'reset' into the "command" column
+will cause the virtual table will reread its edit distance weights
+(if there are any).
+</dl></blockquote>
+
+<h2>Algorithm</h2>
+
+The spellfix1 virtual table creates a single
+shadow table named "%_vocab" (where the % is replaced by the name of
+the virtual table; Ex: "demo_vocab" for the "demo" virtual table).  
+the shadow table contains the following columns:
+
+<blockquote><dl>
+<dt><b>id</b><dd>
+The unique id (INTEGER PRIMARY KEY)
+
+<dt><b>rank</b><dd>
+The rank of word.
+
+<dt><b>langid</b><dd>
+The language id for this entry.
+
+<dt><b>word</b><dd>
+The original UTF8 text of the vocabulary word
+
+<dt><b>k1</b><dd>
+The word transliterated into lower-case ASCII.  
+There is a standard table of mappings from non-ASCII
+characters into ASCII.  Examples: "æ" -> "ae",
+"þ" -> "th", "ß" -> "ss", "á" -> "a", ...  The
+accessory function spellfix1_translit(X) will do
+the non-ASCII to ASCII mapping.  The built-in lower(X)
+function will convert to lower-case.  Thus:
+k1 = lower(spellfix1_translit(word)).
+
+<dt><b>k2</b><dd>
+This field holds a phonetic code derived from k1.  Letters
+that have similar sounds are mapped into the same symbol.
+For example, all vowels and vowel clusters become the
+single symbol "A".  And the letters "p", "b", "f", and
+"v" all become "B".  All nasal sounds are represented
+as "N".  And so forth.  The mapping is base on
+ideas found in Soundex, Metaphone, and other
+long-standing phonetic matching systems.  This key can
+be generated by the function spellfix1_phonehash(X).  
+Hence: k2 = spellfix1_phonehash(k1)
+</dl></blockquote>
+
+There is also a function for computing the Wagner edit distance or the
+Levenshtein distance between a pattern and a word.  This function
+is exposed as spellfix1_editdist(X,Y).  The edit distance function
+returns the "cost" of converting X into Y.  Some transformations
+cost more than others.  Changing one vowel into a different vowel,
+for example is relatively cheap, as is doubling a constant, or
+omitting the second character of a double-constant.  Other transformations
+or more expensive.  The idea is that the edit distance function returns
+a low cost of words that are similar and a higher cost for words
+that are futher apart.  In this implementation, the maximum cost
+of any single-character edit (delete, insert, or substitute) is 100,
+with lower costs for some edits (such as transforming vowels).
+
+The "score" for a comparison is the edit distance between the pattern
+and the word, adjusted down by the base-2 logorithm of the word rank.
+For example, a match with distance 100 but rank 1000 would have a
+score of 122 (= 100 - log2(1000) + 32) where as a match with distance
+100 with a rank of 1 would have a score of 131 (100 - log2(1) + 32).
+(NB:  The constant 32 is added to each score to keep it from going
+negative in case the edit distance is zero.)  In this way, frequently
+used words get a slightly lower cost which tends to move them toward
+the top of the list of alternative spellings.
+
+A straightforward implementation of a spelling corrector would be
+to compare the search term against every word in the vocabulary
+and select the 20 with the lowest scores.  However, there will 
+typically be hundreds of thousands or millions of words in the
+vocabulary, and so this approach is not fast enough.
+
+Suppose the term that is being spell-corrected is X.  To limit
+the search space, X is converted to a k2-like key using the
+equivalent of:
+
+<blockquote><pre>
+   key = spellfix1_phonehash(lower(spellfix1_translit(X)))
+</pre></blockquote>
+
+This key is then limited to "scope" characters.  The default scope
+value is 4, but an alternative scope can be specified using the
+"scope=N" term in the WHERE clause.  After the key has been truncated,
+the edit distance is run against every term in the vocabulary that
+has a k2 value that begins with the abbreviated key.
+
+For example, suppose the input word is "Paskagula".  The phonetic 
+key is "BACACALA" which is then truncated to 4 characters "BACA".
+The edit distance is then run on the 4980 entries (out of
+272,597 entries total) of the vocabulary whose k2 values begin with
+BACA, yielding "Pascagoula" as the best match.
+
+Only terms of the vocabulary with a matching langid are searched.
+Hence, the same table can contain entries from multiple languages
+and only the requested language will be used.  The default langid
+is 0.
+
+<h2>Configurable Edit Distance</h2>
+
+The built-in Wagner edit-distance function with fixed weights can be
+replaced by the [./editdist3.wiki | editdist3()] edit-distance function
+with application-defined weights and support for unicode, by specifying
+the "edit_cost_table=<i>TABLENAME</i>" parameter to the spellfix1 module
+when the virtual table is created.
+For example:
+
+<blockquote><pre>
+CREATE VIRTUAL TABLE demo2 USING spellfix1(edit_cost_table=APPCOST);
+</pre></blockquote>
+
+In the example above, the APPCOST table would be interrogated to find
+the edit distance coefficients.  It is the presence of the "edit_cost_table="
+parameter to the spellfix1 module name that causes editdist3() to be used
+in place of the built-in edit distance function.
+
+The edit distance coefficients are normally read from the APPCOST table
+once and there after stored in memory.  Hence, run-time changes to the
+APPCOST table will not normally effect the edit distance results.
+However, inserting the special string 'reset' into the "command" column of the
+virtual table causes the edit distance coefficients to be reread the
+APPCOST table.  Hence, applications should run a SQL statement similar
+to the following when changes to the APPCOST table occur:
+
+<blockquote>
+INSERT INTO demo2(command) VALUES('reset');
+</blockquote>
+
+The tables used for edit distance costs can be changed using a command
+like the following:
+
+<blockquote>
+INSERT INTO demo2(command) VALUES('edit_cost_table=APPCOST2');
+</blockquote>
+
+In the example above, any prior edit distance costs would be discarded and
+all future queries would use the costs found in the APPCOST2 table.  If the
+name of the table specified by the "edit_cost_table" command is "NULL", then
+theh built-in Wagner edit-distance function will be used instead of the
+editdist3() function in all future queries.
+
+<h2>Dealing With Unusual And Difficult Spellings</h2>
+
+The algorithm above works quite well for most cases, but there are
+exceptions.  These exceptions can be dealt with by making additional
+entries in the virtual table using the "soundslike" column.
+
+For example, many words of Greek origin begin with letters "ps" where
+the "p" is silent.  Ex:  psalm, pseudonym, psoriasis, psyche.  In
+another example, many Scottish surnames can be spelled with an
+initial "Mac" or "Mc".  Thus, "MacKay" and "McKay" are both pronounced
+the same.
+
+Accommodation can be made for words that are not spelled as they
+sound by making additional entries into the virtual table for the
+same word, but adding an alternative spelling in the "soundslike"
+column.  For example, the canonical entry for "psalm" would be this:
+
+<blockquote><pre>
+  INSERT INTO demo(word) VALUES('psalm');
+</pre></blockquote>
+
+To enhance the ability to correct the spelling of "salm" into
+"psalm", make an addition entry like this:
+
+<blockquote><pre>
+  INSERT INTO demo(word,soundslike) VALUES('psalm','salm');
+</pre></blockquote>
+
+It is ok to make multiple entries for the same word as long as
+each entry has a different soundslike value.  Note that if no
+soundslike value is specified, the soundslike defaults to the word
+itself.
+
+Listed below are some cases where it might make sense to add additional
+soundslike entries.  The specific entries will depend on the application
+and the target language.
+
+  *   Silent "p" in words beginning with "ps":  psalm, psyche
+
+  *   Silent "p" in words beginning with "pn":  pneumonia, pneumatic
+
+  *   Silent "p" in words beginning with "pt":  pterodactyl, ptolemaic
+
+  *   Silent "d" in words beginning with "dj":  djinn, Djikarta
+
+  *   Silent "k" in words beginning with "kn":  knight, Knuthson
+
+  *   Silent "g" in words beginning with "gn":  gnarly, gnome, gnat
+
+  *   "Mac" versus "Mc" beginning Scottish surnames
+
+  *   "Tch" sounds in Slavic words:  Tchaikovsky vs. Chaykovsky
+
+  *   The letter "j" pronounced like "h" in Spanish:  LaJolla
+
+  *   Words beginning with "wr" versus "r":  write vs. rite
+
+  *   Miscellanous problem words such as "debt", "tsetse",
+      "Nguyen", "Van Nuyes".
+
+<h2>Auxiliary Functions</h2>
+
+The source code module that implements the spellfix1 virtual table also
+implements several SQL functions that might be useful to applications
+that employ spellfix1 or for testing or diagnostic work while developing
+applications that use spellfix1.  The following auxiliary functions are
+available:
+
+<blockquote><dl>
+<dt><b>editdist3(P,W)<br>editdist2(P,W,L)<br>editdist3(T)</b><dd>
+These routines provide direct access to the version of the Wagner
+edit-distance function that allows for application-defined weights
+on edit operations.  The first two forms of this function compare
+pattern P against word W and return the edit distance.  In the first
+function, the langid is assumed to be 0 and in the second, the
+langid is given by the L parameter.  The third form of this function
+reloads edit distance coefficience from the table named by T.
+
+<dt><b>spellfix1_editdist(P,W)</b><dd>
+This routine provides access to the built-in Wagner edit-distance
+function that uses default, fixed costs.  The value returned is
+the edit distance needed to transform W into P.
+
+<dt><b>spellfix1_phonehash(X)</b><dd>
+This routine constructs a phonetic hash of the pure ascii input word X
+and returns that hash.  This routine is used internally by spellfix1 in
+order to transform the K1 column of the shadow table into the K2
+column.
+
+<dt><b>spellfix1_scriptcode(X)</b><dd>
+Given an input string X, this routine attempts to determin the dominant
+script of that input and returns the ISO-15924 numeric code for that
+script.  The current implementation understands the following scripts:
+<ul>
+<li> 215 - Latin
+<li> 220 - Cyrillic
+<li> 200 - Greek
+</ul>
+Additional language codes might be added in future releases.
+
+<dt><b>spellfix1_translit(X)</b><dd>
+This routine transliterates unicode text into pure ascii, returning
+the pure ascii representation of the input text X.  This is the function
+that is used internally to transform vocabulary words into the K1
+column of the shadow table.
+
+</dl></blockquote>
cFrom	cTo	Meaning
''	'?'	The default insertion cost
'?'	''	The default deletion cost
'?'	'?'	The default substitution cost