/ Check-in [381564e9]
Login
SQLite training in Houston TX on 2019-11-05 (details)
Part of the 2019 Tcl Conference

Many hyperlinks are disabled.
Use anonymous login to enable hyperlinks.

Overview
Comment:Add wiki documentation files for the spellfix1 virtual table.
Downloads: Tarball | ZIP archive | SQL archive
Timelines: family | ancestors | descendants | both | trunk
Files: files | file ages | folders
SHA1: 381564e91bbf619f99a48b0b7a94ac586cb9ee79
User & Date: drh 2013-04-25 17:07:26
Context
2013-04-25
17:27
Fix the tool/build-shell.sh script to remove references to files that are now loadable extensions. check-in: aabeea98 user: drh tags: trunk
17:07
Add wiki documentation files for the spellfix1 virtual table. check-in: 381564e9 user: drh tags: trunk
16:52
Merge the std-ext branch into trunk. This merge adds several new extensions to the ext/misc folder, including transitive_closure, ieee754, and amatch, and it converts some older src/test_*.c file into extensions in the ext/misc folder. check-in: bbe607c7 user: drh tags: trunk
Changes
Hide Diffs Side-by-Side Diffs Ignore Whitespace Patch

Added ext/misc/editdist3.wiki.

            1  +<title>The editdist3 algorithm</title>
            2  +
            3  +The editdist3 algorithm is a function that computes the minimum edit distance
            4  +(a.k.a. the Levenshtein distance) between two input strings.  Features of
            5  +editdist3 include:
            6  +
            7  +   *   It works with unicode (UTF8) text.
            8  +
            9  +   *   A table of insertion, deletion, and substitution costs can be 
           10  +       provided by the application.
           11  +
           12  +   *   Multi-character insertsions, deletions, and substitutions can be
           13  +       enumerated in the cost table.
           14  +
           15  +<h2>The COST table</h2>
           16  +
           17  +To program the costs of editdist3, create a table such as the following:
           18  +
           19  +<blockquote><pre>
           20  +CREATE TABLE editcost(
           21  +  iLang INT,   -- The language ID
           22  +  cFrom TEXT,  -- Convert text from this
           23  +  cTo   TEXT,  -- Convert text into this
           24  +  iCost INT    -- The cost of doing the conversionnn
           25  +);
           26  +</pre></blockquote>
           27  +
           28  +The cost table can be named anything you want - it does not have to be called
           29  +"editcost".  And the table can contain additional columns.  However, it the
           30  +table must contain the four columns show above, with exactly the names shown.
           31  +
           32  +The iLang column is a non-negative integer that identifies a set of costs
           33  +appropriate for a particular language.  The editdist3 function will only use
           34  +a single iLang value for any given edit-distance computation.  The default
           35  +value is 0.  It is recommended that applications that only need to use a
           36  +single langauge always use iLang==0 for all entries.
           37  +
           38  +The iCost column is the numeric cost of transforming cFrom into cTo.  This
           39  +value should be a non-negative integer, and should probably be less than 100.
           40  +The default single-character insertion and deletion costs are 100 and the
           41  +default single-character to single-character substitution cost is 150.  A
           42  +cost of 10000 or more is considered "infinite" and causes the rule to be
           43  +ignored.
           44  +
           45  +The cFrom and cTo columns show edit transformation strings.  Either or both
           46  +columns may contain more than one character.  Or either column (but not both)
           47  +may hold an empty string.  When cFrom is empty, that is the cost of inserting
           48  +cTo.  When cTo is empty, that is the cost of deleting cFrom.
           49  +
           50  +In the spellfix1 algorithm, cFrom is the text as the user entered it and
           51  +cTo is the correctly spelled text as it exists in the database.  The goal
           52  +of the editdist3 algorithm is to determine how close the user-entered text is
           53  +to the dictionary text.
           54  +
           55  +There are three special-case entries in the cost table:
           56  +
           57  +<table border=1>
           58  +<tr><th>cFrom</th><th>cTo</th><th>Meaning</th></tr>
           59  +<tr><td>''</td><td>'?'</td><td>The default insertion cost</td></tr>
           60  +<tr><td>'?'</td><td>''</td><td>The default deletion cost</td></tr>
           61  +<tr><td>'?'</td><td>'?'</td><td>The default substitution cost</td></tr>
           62  +</table>
           63  +
           64  +If any of the special-case entries shows above are omitted, then the
           65  +value of 100 is used for insertion and deletion and 150 is used for
           66  +substitution.  To disable the default insertion, deletion, and/or substitution
           67  +set their respective cost to 10000 or more.
           68  +
           69  +Other entries in the cost table specific transforms for particular characters.
           70  +The cost of specific transforms should be less than the default costs, or else
           71  +the default costs will take precedence and the specific transforms will never 
           72  +be used.
           73  +
           74  +Some example, cost table entries:
           75  +
           76  +<blockquote><pre>
           77  +INSERT INTO editcost(iLang, cFrom, cTo, iCost)
           78  +VALUES(0, 'a', 'ä', 5);
           79  +</pre></blockquote>
           80  +
           81  +The rule above says that the letter "a" in user input can be matched against
           82  +the letter "ä" in the dictionary with a penalty of 5.
           83  +
           84  +<blockquote><pre>
           85  +INSERT INTO editcost(iLang, cFrom, cTo, iCost)
           86  +VALUES(0, 'ss', 'ß', 8);
           87  +</pre></blockquote>
           88  +
           89  +The number of characters in cFrom and cTo do not need to be the same.  The
           90  +rule above says that "ss" on user input will match "ß" with a penalty of 8.
           91  +
           92  +<h2>Experimenting with the editcost3() function</h2>
           93  +
           94  +The [./spellfix1.wiki | spellfix1 virtual table]
           95  +uses editdist3 if the "edit_cost_table=TABLE" option
           96  +is specified as an argument when the spellfix1 virtual table is created.  
           97  +But editdist3 can also be tested directly using the built-in "editdist3()"
           98  +SQL function.  The editdist3() SQL function has 3 forms:
           99  +
          100  +  1.  editdist3('TABLENAME');
          101  +  2.  editdist3('string1', 'string2');
          102  +  3.  editdist3('string1', 'string2', langid);
          103  +
          104  +The first form loads the edit distance coefficients from a table called
          105  +'TABLENAME'.  Any prior coefficients are discarded.  So when experimenting
          106  +with weights and the weight table changes, simply rerun the single-argument
          107  +form of editdist3() to reload revised coefficients.  Note that the 
          108  +edit distance
          109  +weights used by the editdist3() SQL function are independent from the
          110  +weights used by the spellfix1 virtual table.
          111  +
          112  +The second and third forms return the computed edit distance between strings
          113  +'string1' and "string2'.  In the second form, an language id of 0 is used.
          114  +The language id is specified in the third form.

Added ext/misc/spellfix1.wiki.

            1  +<title>The Spellfix1 Virtual Table</title>
            2  +
            3  +This spellfix1 virtual table is used to search
            4  +a large vocabulary for close matches.  For example, spellfix1
            5  +can be used to suggest corrections to misspelled words.  Or,
            6  +it could be used with FTS4 to do full-text search using potentially
            7  +misspelled words.
            8  +
            9  +Create an instance of the spellfix1 virtual table like this:
           10  +
           11  +<blockquote><pre>
           12  +CREATE VIRTUAL TABLE demo USING spellfix1;
           13  +</pre></blockquote>
           14  +
           15  +The "spellfix1" term is the name of this module and must be entered as
           16  +shown.  The "demo" term is the
           17  +name of the virtual table you will be creating and can be altered
           18  +to suit the needs of your application.  The virtual table is initially
           19  +empty.  In order for the virtual table to be useful, you will need to
           20  +populate it with your vocabulary.  Suppose you
           21  +have a list of words in a table named "big_vocabulary".  Then do this:
           22  +
           23  +<blockquote><pre>
           24  +INSERT INTO demo(word) SELECT word FROM big_vocabulary;
           25  +</pre></blockquote>
           26  +
           27  +If you intend to use this virtual table in cooperation with an FTS4
           28  +table (for spelling correctly of search terms) then you might extract
           29  +the vocabulary using an fts3aux table:
           30  +
           31  +<blockquote><pre>
           32  +INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*';
           33  +</pre></blockquote>
           34  +
           35  +You can also provide the virtual table with a "rank" for each word.
           36  +The "rank" is an estimate of how common the word is.  Larger numbers
           37  +mean the word is more common.  If you omit the rank when populating
           38  +the table, then a rank of 1 is assumed.  But if you have rank 
           39  +information, you can supply it and the virtual table will show a
           40  +slight preference for selecting more commonly used terms.  To
           41  +populate the rank from an fts4aux table "search_aux" do something
           42  +like this:
           43  +
           44  +<blockquote><pre>
           45  +INSERT INTO demo(word,rank)
           46  +   SELECT term, documents FROM search_aux WHERE col='*';
           47  +</pre></blockquote>
           48  +
           49  +To query the virtual table, include a MATCH operator in the WHERE
           50  +clause.  For example:
           51  +
           52  +<blockquote><pre>
           53  +SELECT word FROM demo WHERE word MATCH 'kennasaw';
           54  +</pre></blockquote>
           55  +
           56  +Using a dataset of American place names (derived from
           57  +[http://geonames.usgs.gov/domestic/download_data.htm]) the query above
           58  +returns 20 results beginning with:
           59  +
           60  +<blockquote><pre>
           61  +kennesaw
           62  +kenosha
           63  +kenesaw
           64  +kenaga
           65  +keanak
           66  +</pre></blockquote>
           67  +
           68  +If you append the character '*' to the end of the pattern, then
           69  +a prefix search is performed.  For example:
           70  +
           71  +<blockquote><pre>
           72  +SELECT word FROM demo WHERE word MATCH 'kennes*';
           73  +</pre></blockquote>
           74  +
           75  +Yields 20 results beginning with:
           76  +
           77  +<blockquote><pre>
           78  +kennesaw
           79  +kennestone
           80  +kenneson
           81  +kenneys
           82  +keanes
           83  +keenes
           84  +</pre></blockquote>
           85  +
           86  +<h2>Search Refinements</h2>
           87  +
           88  +By default, the spellfix1 table returns no more than 20 results.
           89  +(It might return less than 20 if there were fewer good matches.)
           90  +You can change the upper bound on the number of returned rows by
           91  +adding a "top=N" term to the WHERE clause of your query, where N
           92  +is the new maximum.  For example, to see the 5 best matches:
           93  +
           94  +<blockquote><pre>
           95  +SELECT word FROM demo WHERE word MATCH 'kennes*' AND top=5;
           96  +</pre></blockquote>
           97  +
           98  +Each entry in the spellfix1 virtual table is associated with a
           99  +a particular language, identified by the integer "langid" column.
          100  +The default langid is 0 and if no other actions are taken, the
          101  +entire vocabulary is a part of the 0 language.  But if your application
          102  +needs to operate in multiple languages, then you can specify different
          103  +vocabulary items for each language by specifying the langid field
          104  +when populating the table.  For example:
          105  +
          106  +<blockquote><pre>
          107  +INSERT INTO demo(word,langid) SELECT word, 0 FROM en_vocabulary;
          108  +INSERT INTO demo(word,langid) SELECT word, 1 FROM de_vocabulary;
          109  +INSERT INTO demo(word,langid) SELECT word, 2 FROM fr_vocabulary;
          110  +INSERT INTO demo(word,langid) SELECT word, 3 FROM ru_vocabulary;
          111  +INSERT INTO demo(word,langid) SELECT word, 4 FROM cn_vocabulary;
          112  +</pre></blockquote>
          113  +
          114  +After the virtual table has been populated with items from multiple
          115  +languages, specify the language of interest using a "langid=N" term
          116  +in the WHERE clause of the query:
          117  +
          118  +<blockquote><pre>
          119  +SELECT word FROM demo WHERE word MATCH 'hildes*' AND langid=1;
          120  +</pre></blockquote>
          121  +
          122  +Note that if you do not include the "langid=N" term in the WHERE clause,
          123  +the search will be against language 0 (English in the example above.)
          124  +All spellfix1 searches are against a single language id.  There is no
          125  +way to search all languages at once.
          126  + 
          127  +
          128  +<h2>Virtual Table Details</h2>
          129  +
          130  +The virtual table actually has a unique rowid with seven columns plus five
          131  +extra hidden columns.  The columns are as follows:
          132  +
          133  +<blockquote><dl>
          134  +<dt><b>rowid</b><dd>
          135  +A unique integer number associated with each
          136  +vocabulary item in the table.  This can be used
          137  +as a foreign key on other tables in the database.
          138  +
          139  +<dt><b>word</b><dd>
          140  +The text of the word that matches the pattern.
          141  +Both word and pattern can contains unicode characters
          142  +and can be mixed case.
          143  +
          144  +<dt><b>rank</b><dd>
          145  +This is the rank of the word, as specified in the
          146  +original INSERT statement.
          147  +
          148  +
          149  +<dt><b>distance</b><dd>
          150  +This is an edit distance or Levensthein distance going
          151  +from the pattern to the word.
          152  +
          153  +<dt><b>langid</b><dd>
          154  +This is the language-id of the word.  All queries are
          155  +against a single language-id, which defaults to 0.
          156  +For any given query this value is the same on all rows.
          157  +
          158  +<dt><b>score</b><dd>
          159  +The score is a combination of rank and distance.  The
          160  +idea is that a lower score is better.  The virtual table
          161  +attempts to find words with the lowest score and 
          162  +by default (unless overridden by ORDER BY) returns
          163  +results in order of increasing score.
          164  +
          165  +<dt><b>matchlen</b><dd>
          166  +In a prefix search, the matchlen is the number of characters in
          167  +the string that match against the prefix.  For a non-prefix search,
          168  +this is the same as length(word).
          169  +
          170  +<dt><b>phonehash</b><dd>
          171  +This column shows the phonetic hash prefix that was used to restrict
          172  +the search.  For any given query, this column should be the same for
          173  +every row.  This information is available for diagnostic purposes and
          174  +is not normally considered useful in real applications.
          175  +
          176  +<dt><b>top</b><dd>
          177  +(HIDDEN)  For any query, this value is the same on all
          178  +rows.  It is an integer which is the maximum number of
          179  +rows that will be output.  The actually number of rows
          180  +output might be less than this number, but it will never
          181  +be greater.  The default value for top is 20, but that
          182  +can be changed for each query by including a term of
          183  +the form "top=N" in the WHERE clause of the query.
          184  +
          185  +<dt><b>scope</b><dd>
          186  +(HIDDEN)  For any query, this value is the same on all
          187  +rows.  The scope is a measure of how widely the virtual
          188  +table looks for matching words.  Smaller values of
          189  +scope cause a broader search.  The scope is normally
          190  +choosen automatically and is capped at 4.  Applications
          191  +can change the scope by including a term of the form
          192  +"scope=N" in the WHERE clause of the query.  Increasing
          193  +the scope will make the query run faster, but will reduce
          194  +the possible corrections.
          195  +
          196  +<dt><b>srchcnt</b><dd>
          197  +(HIDDEN)  For any query, this value is the same on all
          198  +rows.  This value is an integer which is the number of
          199  +of words examined using the edit-distance algorithm to
          200  +find the top matches that are ultimately displayed.  This
          201  +value is for diagnostic use only.
          202  +
          203  +<dt><b>soundslike</b><dd>
          204  +(HIDDEN)  When inserting vocabulary entries, this field
          205  +can be set to an spelling that matches what the word
          206  +sounds like.  See the DEALING WITH UNUSUAL AND DIFFICULT
          207  +SPELLINGS section below for details.
          208  +
          209  +<dt><b>command</b><dd>
          210  +(HIDDEN)  The value of the "command" column is always NULL.  However,
          211  +applications can insert special strings into the "command" column in order
          212  +to provoke certain behaviors in the spellfix1 virtual table.
          213  +For example, inserting the string 'reset' into the "command" column
          214  +will cause the virtual table will reread its edit distance weights
          215  +(if there are any).
          216  +</dl></blockquote>
          217  +
          218  +<h2>Algorithm</h2>
          219  +
          220  +The spellfix1 virtual table creates a single
          221  +shadow table named "%_vocab" (where the % is replaced by the name of
          222  +the virtual table; Ex: "demo_vocab" for the "demo" virtual table).  
          223  +the shadow table contains the following columns:
          224  +
          225  +<blockquote><dl>
          226  +<dt><b>id</b><dd>
          227  +The unique id (INTEGER PRIMARY KEY)
          228  +
          229  +<dt><b>rank</b><dd>
          230  +The rank of word.
          231  +
          232  +<dt><b>langid</b><dd>
          233  +The language id for this entry.
          234  +
          235  +<dt><b>word</b><dd>
          236  +The original UTF8 text of the vocabulary word
          237  +
          238  +<dt><b>k1</b><dd>
          239  +The word transliterated into lower-case ASCII.  
          240  +There is a standard table of mappings from non-ASCII
          241  +characters into ASCII.  Examples: "æ" -> "ae",
          242  +"þ" -> "th", "ß" -> "ss", "á" -> "a", ...  The
          243  +accessory function spellfix1_translit(X) will do
          244  +the non-ASCII to ASCII mapping.  The built-in lower(X)
          245  +function will convert to lower-case.  Thus:
          246  +k1 = lower(spellfix1_translit(word)).
          247  +
          248  +<dt><b>k2</b><dd>
          249  +This field holds a phonetic code derived from k1.  Letters
          250  +that have similar sounds are mapped into the same symbol.
          251  +For example, all vowels and vowel clusters become the
          252  +single symbol "A".  And the letters "p", "b", "f", and
          253  +"v" all become "B".  All nasal sounds are represented
          254  +as "N".  And so forth.  The mapping is base on
          255  +ideas found in Soundex, Metaphone, and other
          256  +long-standing phonetic matching systems.  This key can
          257  +be generated by the function spellfix1_phonehash(X).  
          258  +Hence: k2 = spellfix1_phonehash(k1)
          259  +</dl></blockquote>
          260  +
          261  +There is also a function for computing the Wagner edit distance or the
          262  +Levenshtein distance between a pattern and a word.  This function
          263  +is exposed as spellfix1_editdist(X,Y).  The edit distance function
          264  +returns the "cost" of converting X into Y.  Some transformations
          265  +cost more than others.  Changing one vowel into a different vowel,
          266  +for example is relatively cheap, as is doubling a constant, or
          267  +omitting the second character of a double-constant.  Other transformations
          268  +or more expensive.  The idea is that the edit distance function returns
          269  +a low cost of words that are similar and a higher cost for words
          270  +that are futher apart.  In this implementation, the maximum cost
          271  +of any single-character edit (delete, insert, or substitute) is 100,
          272  +with lower costs for some edits (such as transforming vowels).
          273  +
          274  +The "score" for a comparison is the edit distance between the pattern
          275  +and the word, adjusted down by the base-2 logorithm of the word rank.
          276  +For example, a match with distance 100 but rank 1000 would have a
          277  +score of 122 (= 100 - log2(1000) + 32) where as a match with distance
          278  +100 with a rank of 1 would have a score of 131 (100 - log2(1) + 32).
          279  +(NB:  The constant 32 is added to each score to keep it from going
          280  +negative in case the edit distance is zero.)  In this way, frequently
          281  +used words get a slightly lower cost which tends to move them toward
          282  +the top of the list of alternative spellings.
          283  +
          284  +A straightforward implementation of a spelling corrector would be
          285  +to compare the search term against every word in the vocabulary
          286  +and select the 20 with the lowest scores.  However, there will 
          287  +typically be hundreds of thousands or millions of words in the
          288  +vocabulary, and so this approach is not fast enough.
          289  +
          290  +Suppose the term that is being spell-corrected is X.  To limit
          291  +the search space, X is converted to a k2-like key using the
          292  +equivalent of:
          293  +
          294  +<blockquote><pre>
          295  +   key = spellfix1_phonehash(lower(spellfix1_translit(X)))
          296  +</pre></blockquote>
          297  +
          298  +This key is then limited to "scope" characters.  The default scope
          299  +value is 4, but an alternative scope can be specified using the
          300  +"scope=N" term in the WHERE clause.  After the key has been truncated,
          301  +the edit distance is run against every term in the vocabulary that
          302  +has a k2 value that begins with the abbreviated key.
          303  +
          304  +For example, suppose the input word is "Paskagula".  The phonetic 
          305  +key is "BACACALA" which is then truncated to 4 characters "BACA".
          306  +The edit distance is then run on the 4980 entries (out of
          307  +272,597 entries total) of the vocabulary whose k2 values begin with
          308  +BACA, yielding "Pascagoula" as the best match.
          309  +
          310  +Only terms of the vocabulary with a matching langid are searched.
          311  +Hence, the same table can contain entries from multiple languages
          312  +and only the requested language will be used.  The default langid
          313  +is 0.
          314  +
          315  +<h2>Configurable Edit Distance</h2>
          316  +
          317  +The built-in Wagner edit-distance function with fixed weights can be
          318  +replaced by the [./editdist3.wiki | editdist3()] edit-distance function
          319  +with application-defined weights and support for unicode, by specifying
          320  +the "edit_cost_table=<i>TABLENAME</i>" parameter to the spellfix1 module
          321  +when the virtual table is created.
          322  +For example:
          323  +
          324  +<blockquote><pre>
          325  +CREATE VIRTUAL TABLE demo2 USING spellfix1(edit_cost_table=APPCOST);
          326  +</pre></blockquote>
          327  +
          328  +In the example above, the APPCOST table would be interrogated to find
          329  +the edit distance coefficients.  It is the presence of the "edit_cost_table="
          330  +parameter to the spellfix1 module name that causes editdist3() to be used
          331  +in place of the built-in edit distance function.
          332  +
          333  +The edit distance coefficients are normally read from the APPCOST table
          334  +once and there after stored in memory.  Hence, run-time changes to the
          335  +APPCOST table will not normally effect the edit distance results.
          336  +However, inserting the special string 'reset' into the "command" column of the
          337  +virtual table causes the edit distance coefficients to be reread the
          338  +APPCOST table.  Hence, applications should run a SQL statement similar
          339  +to the following when changes to the APPCOST table occur:
          340  +
          341  +<blockquote>
          342  +INSERT INTO demo2(command) VALUES('reset');
          343  +</blockquote>
          344  +
          345  +The tables used for edit distance costs can be changed using a command
          346  +like the following:
          347  +
          348  +<blockquote>
          349  +INSERT INTO demo2(command) VALUES('edit_cost_table=APPCOST2');
          350  +</blockquote>
          351  +
          352  +In the example above, any prior edit distance costs would be discarded and
          353  +all future queries would use the costs found in the APPCOST2 table.  If the
          354  +name of the table specified by the "edit_cost_table" command is "NULL", then
          355  +theh built-in Wagner edit-distance function will be used instead of the
          356  +editdist3() function in all future queries.
          357  +
          358  +<h2>Dealing With Unusual And Difficult Spellings</h2>
          359  +
          360  +The algorithm above works quite well for most cases, but there are
          361  +exceptions.  These exceptions can be dealt with by making additional
          362  +entries in the virtual table using the "soundslike" column.
          363  +
          364  +For example, many words of Greek origin begin with letters "ps" where
          365  +the "p" is silent.  Ex:  psalm, pseudonym, psoriasis, psyche.  In
          366  +another example, many Scottish surnames can be spelled with an
          367  +initial "Mac" or "Mc".  Thus, "MacKay" and "McKay" are both pronounced
          368  +the same.
          369  +
          370  +Accommodation can be made for words that are not spelled as they
          371  +sound by making additional entries into the virtual table for the
          372  +same word, but adding an alternative spelling in the "soundslike"
          373  +column.  For example, the canonical entry for "psalm" would be this:
          374  +
          375  +<blockquote><pre>
          376  +  INSERT INTO demo(word) VALUES('psalm');
          377  +</pre></blockquote>
          378  +
          379  +To enhance the ability to correct the spelling of "salm" into
          380  +"psalm", make an addition entry like this:
          381  +
          382  +<blockquote><pre>
          383  +  INSERT INTO demo(word,soundslike) VALUES('psalm','salm');
          384  +</pre></blockquote>
          385  +
          386  +It is ok to make multiple entries for the same word as long as
          387  +each entry has a different soundslike value.  Note that if no
          388  +soundslike value is specified, the soundslike defaults to the word
          389  +itself.
          390  +
          391  +Listed below are some cases where it might make sense to add additional
          392  +soundslike entries.  The specific entries will depend on the application
          393  +and the target language.
          394  +
          395  +  *   Silent "p" in words beginning with "ps":  psalm, psyche
          396  +
          397  +  *   Silent "p" in words beginning with "pn":  pneumonia, pneumatic
          398  +
          399  +  *   Silent "p" in words beginning with "pt":  pterodactyl, ptolemaic
          400  +
          401  +  *   Silent "d" in words beginning with "dj":  djinn, Djikarta
          402  +
          403  +  *   Silent "k" in words beginning with "kn":  knight, Knuthson
          404  +
          405  +  *   Silent "g" in words beginning with "gn":  gnarly, gnome, gnat
          406  +
          407  +  *   "Mac" versus "Mc" beginning Scottish surnames
          408  +
          409  +  *   "Tch" sounds in Slavic words:  Tchaikovsky vs. Chaykovsky
          410  +
          411  +  *   The letter "j" pronounced like "h" in Spanish:  LaJolla
          412  +
          413  +  *   Words beginning with "wr" versus "r":  write vs. rite
          414  +
          415  +  *   Miscellanous problem words such as "debt", "tsetse",
          416  +      "Nguyen", "Van Nuyes".
          417  +
          418  +<h2>Auxiliary Functions</h2>
          419  +
          420  +The source code module that implements the spellfix1 virtual table also
          421  +implements several SQL functions that might be useful to applications
          422  +that employ spellfix1 or for testing or diagnostic work while developing
          423  +applications that use spellfix1.  The following auxiliary functions are
          424  +available:
          425  +
          426  +<blockquote><dl>
          427  +<dt><b>editdist3(P,W)<br>editdist2(P,W,L)<br>editdist3(T)</b><dd>
          428  +These routines provide direct access to the version of the Wagner
          429  +edit-distance function that allows for application-defined weights
          430  +on edit operations.  The first two forms of this function compare
          431  +pattern P against word W and return the edit distance.  In the first
          432  +function, the langid is assumed to be 0 and in the second, the
          433  +langid is given by the L parameter.  The third form of this function
          434  +reloads edit distance coefficience from the table named by T.
          435  +
          436  +<dt><b>spellfix1_editdist(P,W)</b><dd>
          437  +This routine provides access to the built-in Wagner edit-distance
          438  +function that uses default, fixed costs.  The value returned is
          439  +the edit distance needed to transform W into P.
          440  +
          441  +<dt><b>spellfix1_phonehash(X)</b><dd>
          442  +This routine constructs a phonetic hash of the pure ascii input word X
          443  +and returns that hash.  This routine is used internally by spellfix1 in
          444  +order to transform the K1 column of the shadow table into the K2
          445  +column.
          446  +
          447  +<dt><b>spellfix1_scriptcode(X)</b><dd>
          448  +Given an input string X, this routine attempts to determin the dominant
          449  +script of that input and returns the ISO-15924 numeric code for that
          450  +script.  The current implementation understands the following scripts:
          451  +<ul>
          452  +<li> 215 - Latin
          453  +<li> 220 - Cyrillic
          454  +<li> 200 - Greek
          455  +</ul>
          456  +Additional language codes might be added in future releases.
          457  +
          458  +<dt><b>spellfix1_translit(X)</b><dd>
          459  +This routine transliterates unicode text into pure ascii, returning
          460  +the pure ascii representation of the input text X.  This is the function
          461  +that is used internally to transform vocabulary words into the K1
          462  +column of the shadow table.
          463  +
          464  +</dl></blockquote>