Bilingual SQLite search tricks
(1) By jose isaias cabrera (jicman) on 2023-05-11 21:52:38 [source]
Greetings.
I have a website where folks search with Spanish words, but they don't include the accents. For example, the word 'canción' when searched by the users, they search on 'cancion'. These prevents hits that are real to be hidden. So, I am doing this hack,
SELECT * FROM Songs WHERE st = 1
AND (
replace(replace(replace(replace(replace(title,'á','a'),'é','e'),'í','i'),'ó','o'),'ú','u') LIKE '%cancion%' OR replace(replace(replace(replace(replace(body,'á','a'),'é','e'),'í','i'),'ó','o'),'ú','u') LIKE '%cancion%' OR replace(replace(replace(replace(replace(songType,'á','a'),'é','e'),'í','i'),'ó','o'),'ú','u') LIKE '%cancion%' OR replace(replace(replace(replace(replace(AgendaType,'á','a'),'é','e'),'í','i'),'ó','o'),'ú','u') LIKE '%cancion%' OR replace(replace(replace(replace(replace(key,'á','a'),'é','e'),'í','i'),'ó','o'),'ú','u') LIKE '%cancion%'
)
;
This works, but I was wondering if there is a better way. Thanks.
(2) By Larry Brasfield (larrybr) on 2023-05-11 22:55:14 in reply to 1 [link] [source]
Here is another approach that might be faster:
create table if not exists Songs(rowid integer primary key, title text);
insert or replace into Songs(rowid,title) values
(1, 'mi canción favorita me anima'), (2, 'my favorite cancion')
;
select title from Songs where title regexp 'canci[óo]n';
(3) By Gunter Hick (gunter_hick) on 2023-05-12 05:35:36 in reply to 1 [link] [source]
Back in the 1980s when character recognition was still performed by people paid per address entered, folks would attempt to enter giveaways with variations of their name spelling and address hoping to get picked more easily. This was not desired, because the company collecting the addresses wanted just one address per actual human. To counteract this, we created a system that would adjust the spelling of names via language-dependant pronounciation rules down to a "matchcode". Take the very common German name "Meyer", which has variations of "ey" ("ei", "ai", "ay" ) and my also drop the second e as in "Mayr". All of these matched down to the generic "Meir". So entering any variation of the "Mayer" surname would match all of the stored variations. Perhaps you could create an "accentless Spanish" FTS5 index.
(4) By anonymous on 2023-05-12 06:13:09 in reply to 1 [link] [source]
I see this in two parts a) how to de-accent, and b) where to do that work. For a) I think regex is fine (see https://stackoverflow.com/questions/18123501/replacing-accented-characters-with-plain-ascii-ones). For b) you can do what you are doing on retrieval (as your select does) or on storage. If your db is essentially write once, read many (or slanted enough in that direction) then consider generated stored columns with the regex or replace and index using those columns (see https://www.sqlite.org/gencol.html#:~:text=Generated%20columns%20can%20participate%20in,indirectly%20refer%20back%20to%20itself.) . Obviously, the search would reference the de-accented generated stored columns, and the results would be the original columns. And the trade offs are that your db will be necessarily larger, searches faster, writes slower than if you did the work in the search/select. So benchmark performance, as always, because if theory and practice were the same, they would be synonyms. :)
(5) By punkish on 2023-05-12 07:02:45 in reply to 1 [link] [source]
In my experience, Amazon has one the most impressive language-independent search capabilities, and I have often wondered how they do it. Because of personal background, I routinely need to search for items on Amazon.es, Amazon.in and, of course, Amazon.com. I can enter search terms just about however I wish -- in Spanish ("curcuma"), in English (because my Spanish is not good and I only know the English term ("turmeric") or in Hinglish ("haldi", which is the Devanagari ह्लदी transliterated into the Latin script) -- and I find the items.
That is just a simple example, but really, Amazon is way more reliable than trying to first look up the term in a dictionary, especially when it comes to phrases. For example, my dictionary has entries for "bike" and "stem" but nothing for "bike stem" which is known as "potencia" in Spanish. Amazon finds it no matter how I enter it, accents or no accents.
The Hinglish search capabilities are really impressive and accurate. I can only imagine the massive, multi-language search database they must have not constructed but also the implementation that takes in any kind of input and spits out the mostly correct results.
(6) By Phil G (phil_g) on 2023-05-12 09:15:53 in reply to 1 [link] [source]
This works…
…for now, perhaps, but what happens when you add a song title to your database including ñ
or ç
? And although you may only be expecting Spanish, what happens if a song with a "foreign" title is added, including other accents that you may not be expecting?
While LIKE
is case-insensitive (by default), REPLACE
is not, so your method is not currently catching anything with accented capital letters.
You also need to apply the same transformations on your user input, in case they do sometimes include accents.
Nested REPLACE
quickly becomes a nightmare as you try to cover more and more characters.
but I was wondering if there is a better way.
SQLite (without the help of extensions) is not really the best choice for this sort of thing. Your favourite programming language probably has much better tools to deal with unicode and reducing accented characters to ascii "equivalents"1, and I'd recommend using these tools (either independently or via an SQLite extension) rather than attempting the conversion in pure SQL.
A few different ways to handle it come to mind:
- Write (or find?) an extension to add a custom function to SQLite that will reduce accented characters to plain ascii.
- Write (or find?) an extension to replace the built-in
LIKE
function with one that ignores both case and accents when matching.- Can the ICU extension do this? I've never used it so I'm not sure, I just know it exists.
- Add extra columns to your table to hold the ascii version of each field and transform the data before adding it to the database. Compare/search using the ascii version, but display the original in the results.
- You could manage the transformation in your favourite programming language before it hits the database, or via a custom function/extension on the way in.
- These could perhaps be
GENERATED … STORED
columns using a custom function/extension to transform the text.
Note that handling this during SELECT
means processing the transformation for every relevant column in every row of the database every time, which is going to be inefficient (but depending on the size of the database and user expectations, may not actually be a problem). On the other hand, transforming just once and storing the result will be more efficient but will nearly double the storage size for text columns. Which approach is better depends on whether speed or database size is more important to you.
- ^ Noting that in some languages, (some) accented characters are considered to be entirely separate letters in their own right, not equivalent to the base ascii character
(7) By Rowan Worth (sqweek) on 2023-05-12 09:29:47 in reply to 6 [link] [source]
Can the ICU extension do this? I've never used it so I'm not sure, I just know it exists.
Good question. A quick read reveals nothing about diacritics or accents in the high level documentation. Similarly it's not clear if ICU regexps can handle accent-agnostic matching.
Your favourite programming language probably has much better tools to deal with unicode and reducing accented characters to ascii "equivalents", and I'd recommend using these tools (either independently or via an SQLite extension) rather than attempting the conversion in pure SQL.
Second this, especially as the use of LIKE '%foo%'
in the current approach effectively requires a table scan for every search anyway (ain't no simple index that helps with a wildcard prefix).
It might also be worth reading up on unicode normalization forms to provide some consistency when inserting records into the database.
(8) By jchd (jchd18) on 2023-05-12 09:51:10 in reply to 1 [link] [source]
Hi Jose, I don't remember which OS you're using, but if it's Windows, then you could use my unifuzz extension. It offer unaccented search, fuzzy search and much more, with full source included. Only multilingual collation is Windows-dependant. Unaccenting/folding relies on internal tables based on Unicode properties of the BMP. I've little time just now. Search the forum for unifuzz.zip or something like that. If you don't get it, mail me jcd at antichoc.net and I'll post it to you.
(9) By Donal Fellows (dkfellows) on 2023-05-12 10:17:20 in reply to 1 [link] [source]
I was wondering if there is a better way.
There is, but it is at least partially outside the scope of SQLite.
You need to create a second column with the accent-less version of the titles (and equivalent for other text things that they search on). The de-accenting can be done by normalizing the text to NFD (Unicode normalization form, decomposed) and stripping the pure accent characters (especially of class Mn
, "Non-spacing marks") from that. That's the part which SQLite doesn't provide (and isn't going to; the libraries for doing it are a bit too large given that many use cases for SQLite don't need the functionality). Once you've got that function, you can then prepare the text on which you search; there's a bunch of options for that. In theory, you could expose the normalize-strip function as a User Defined Function to SQLite, but you probably don't need to do that; I suspect it's not a good idea in performance terms to be running LIKE against things that are computed there and then (no way to implement it without a full table scan).
You'll probably want to also do the normalize-strip trick for the search terms entered by users. You can't count on them not using accents even if it is common for them to do so.
(10) By jchd (jchd18) on 2023-05-12 11:54:41 in reply to 8 [link] [source]
Here's a download link. Included .dll is 32-bit, ready for use.
https://u.pcloud.link/publink/show?code=XZtjitVZuJD0RhF1eDpAtSf38l3tmSbnFeS7
(11) By Keith Medcalf (kmedcalf) on 2023-05-12 13:12:52 in reply to 1 [link] [source]
You are, of course, posing the most difficult problem -- an unbounded instring brute force search. Presumably it is your "replace(replace(replace(..." crud you want to fix.
Unfortunately, the default ICU implementation does not appear to fold accents, only case. You can, however, code a translitterator to replace your "replace(replace(replace(..." monstrosity.
If you were to forgo the brute force instring search and use a "proper index" to search "from the beginning of the field", then you would simply need to use a proper collation, eg: en-u-ks-level1 or es-u-ks-level1 (und-u-ks-level1 would probably even be correct) when you define the column. Thereafter the column is case and accent preserving, but case and accent ignoring.
You could also use an ICU transliterator to do the case and mark folding. The following transliteration would do the trick:
"Latin;NFKD;[:Nonspacing Mark:]Remove;Lower;NFKC;ascii"
That is, convert all the characters to latin, decompose, remove non-spacing marks (accents), convert to lowercase, recompose, convert to ascii.
An example ICU function might look as follows:
static void icuTransliterate(sqlite3_context *ctx, int argc, sqlite3_value **argv)
{
UParseError pe;
UTransliterator *t = 0;
UErrorCode status = U_ZERO_ERROR;
UChar *xform_id;
UChar *buffer;
int n, limit, bs;
int slot = -1229149472; /* (unicode('I')*16777216 + unicode('C')*65536 + unicode('U')*256 + unicode(' ')) */
const char *data;
if (argc == 2) slot = 1;
t = sqlite3_get_auxdata(ctx, slot);
if (!t) {
if (argc == 1) {
xform_id = L"Latin;NFKD;[:Nonspacing Mark:]Remove;Lower;NFKC;ascii";
} else {
xform_id = (UChar*)sqlite3_value_text16(argv[1]);
}
t = utrans_openU(xform_id, -1, UTRANS_FORWARD, NULL, -1, &pe, &status);
if (U_FAILURE(status)) {
icuFunctionError(ctx, "utrans_openU", status);
return;
}
sqlite3_set_auxdata(ctx, slot, t, icuTransliterateDelete);
t = sqlite3_get_auxdata(ctx, slot);
if (!t) {
sqlite3_result_error_nomem(ctx);
return;
}
}
data = sqlite3_value_text16(argv[0]);
n = sqlite3_value_bytes16(argv[0]);
limit = n/2;
bs = sizeof(UChar) * n;
buffer = sqlite3_malloc(bs);
memcpy(buffer, data, n);
n /= 2;
utrans_transUChars(t, buffer, &n, n*2, 0, &limit, &status);
if (U_FAILURE(status)) {
sqlite3_free(buffer);
icuFunctionError(ctx, "utrans_transUChars", status);
return;
}
sqlite3_result_text16(ctx, buffer, limit*2, SQLITE_TRANSIENT);
sqlite3_free(buffer);
}
{"fold", 1, SQLITE_ANY|SQLITEICU_EXTRAFLAGS, 0, icuTransliterate},
{"icu_trans",2,SQLITE_ANY|SQLITEICU_EXTRAFLAGS, 0, icuTransliterate},
You could then use:
SELECT *
FROM Songs
WHERE st = 1
AND (
fold(title) LIKE '%cancion%' OR
fold(body) LIKE '%cancion%' OR
fold(songType) LIKE '%cancion%' OR
fold(AgendaType) LIKE '%cancion%' OR
fold(key) LIKE '%cancion%'
)
;
(12.1) By jose isaias cabrera (jicman) on 2023-05-12 13:26:17 edited from 12.0 in reply to 1 [link] [source]
Thank you all for all those wonderful comments and suggestions. I am going to read and see what the best solution for me is.
BTW, I started using GLOB and REGEXP, but, because the input is 99% lowercase, this happens:
sqlite> select a from t where glob('*dios*',a);
VM-steps: 15
Run Time: real 0.000 user 0.000000 sys 0.000000
sqlite> select a from t where glob('*Dios*',a);
┌────────┐
│ a │
├────────┤
│ 'Dios' │
└────────┘
VM-steps: 17
Run Time: real 0.000 user 0.000000 sys 0.000000
OR
sqlite> select a from t where regexp(a,'*dios*');
VM-steps: 15
Run Time: real 0.000 user 0.000000 sys 0.000000
sqlite> select a from t where regexp(a,'*Dios*');
┌────────┐
│ a │
├────────┤
│ 'Dios' │
└────────┘
VM-steps: 17
Run Time: real 0.015 user 0.000000 sys 0.000000
So, I would have to set a to lowercase and that would be a lot of text processing. Unless there is a GLOB or REGEXP option to search on just lowercase. I couldn't find one on the docs. So, LIKE is the best option for me, right now.
sqlite> select a from t where a like '%dios%';
┌────────┐
│ a │
├────────┤
│ 'Dios' │
└────────┘
VM-steps: 17
Run Time: real 0.000 user 0.000000 sys 0.000000
Thanks.
(13.1) By J-L Hainaut (JLHainaut) on 2023-05-14 09:05:49 edited from 13.0 in reply to 1 [link] [source]
Some RDBMS (Oracle, SQL Server, Postgresql, maybe others) offer the built-in function "translate", which returns "the string from the first argument after the characters specified in the second argument are translated into the characters specified in the third argument" (copied from Google).
Fairly easy to code as a UDF if not available.
(14.1) By midijohnny on 2023-05-13 22:41:50 edited from 14.0 in reply to 1 [link] [source]
Example CTE for you to experiment with.
This is a bit cack-handed I'm afraid - pretty sure it doesn't need the 'i' (index) column on the map to work, but this was the only way I could figure it out - also the logic is a bit odd - the non-recursive bit of the CTE (_repl) runs over the string with the first item to replace, then after that it counts down from MAX to 2.
Runs fast enough on this small dataset, I don't suppose this will scale that well it things got more complex.
WITH
map(i,s,r) AS (
VALUES
(1,'á','a'),
(2,'é','e'),
(3,'í','i'),
(4,'ó','o'),
(5,'ú','u')),
search(term) AS (
VALUES
('canción'),
('áéíóú')),
_repl(i,_temp) AS
( SELECT (SELECT MAX(i) FROM map), REPLACE(term,s,r) FROM search JOIN map WHERE i=1
UNION ALL
SELECT i-1, REPLACE(_temp,s,r) FROM _repl JOIN map USING(i)
LIMIT 100 -- safety net
)
SELECT * FROM _repl WHERE i=1
To use in application - you could replace the 'search' CTE with a parameter like:
[...]
search(term) AS ( VALUES(:search) ),
[...]
And the 'map' table could be a physical table of course, I just added to CTE to provide complete example.
Example output:
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .mode markdown
sqlite> .read eg.sql
i | _temp |
---|---|
1 | cancion |
1 | aeiou |
- EDIT: changed COUNT(*) to MAX(i), fewer assumptions.