SQLite User Forum

Is it too late to ask for EXACT in fts5 (vs NEAR)
Login

Is it too late to ask for EXACT in fts5 (vs NEAR)

(1.1) By Max (Maxulite) on 2024-11-13 09:22:10 edited from 1.0 [source]

Hi

Occasionally when searching in my sources with fts5, I need to find two words exactly some distance apart (but not following each other) and in this case NEAR gives too many "false positives" so I have to list them one by one and evaluate. As I understand currently it's not possible to narrow the search to this kind of constraint (CMIIW)

If technically it's not a big challenge, I wanted to suggest the EXACT group as a way to achieve this. But this (or any other new keyword) might introduce some regression if in the wild there are already many cases when EXACT was used as a general token.

Maybe there are other suggestions for syntax change. Sure, If I'm the only one who suffer, no big deal. Also maybe others might have some cases where fine distance tuning might be useful

Thanks

(2) By Roger Binns (rogerbinns) on 2024-11-13 15:12:56 in reply to 1.1 [link] [source]

You can do this by writing your own auxiliary function using the extension API.

You could use that function to return the word distance:

SELECT .... FROM mytable('hello world') WHERE word_distance(mytable) = 7 ORDER by rank;

Alternately your function could be used for ranking where you boost the scores of matched rows that meet your distance requirements. That will still allow the rows without the exact match to show lower in your results.

SELECT .... FROM mytable('hello world') ORDER BY word_distance(mytable, 7);

Changing the FTS5 query meaning would break existing code.

If you are using Python, then this will be easy. Contact me off list.

(5) By Max (Maxulite) on 2024-11-14 10:21:38 in reply to 2 [link] [source]

Thanks for your suggestions. Probably I will implement something like this (if NOT NEAR imperfections make me do this). The ideal is to make this feature working through the syntax. I already had some luck to "hack" syntax by introducing custom prefixes (using fts5 column filter syntax when the prefix name doesn't correspond to an existing column). An example is 'matchlike' prefix allowing a LIKE-like syntax to be used in fts5 queries. So in my bases I can query [ matchlike:%ing ]. Technically it catches "column not found errors" and replaces the erroneous part with an encoded "word" to be used further by providing synonyms to fts5. Maybe I might recognize a special prefix that alters the SELECT query to contain a distance function or something like this

(3) By anonymous on 2024-11-13 16:50:36 in reply to 1.1 [link] [source]

Did you try samething along the lines of

NEAR(word1 word2,10) NOT NEAR(word1 word2,9)
as equivalent of a nonexisting EXACT(word1 word2,9) ? Only tested briefly, seems to work for me.

(4) By Max (Maxulite) on 2024-11-14 09:58:23 in reply to 3 [link] [source]

Nice suggestion, thanks

There's at least one obstacle though. Looking at how it is supposed to work, it may omit some results. Taking your example, the query excludes all docs with the words distance 9 or less, so I will not see the docs where the desirable results exist (distance=10) along the not desirable ones (distance 9 or less). The example from the sqlite.org search is the query [ NEAR(row contains, 1) NOT NEAR(row contains, 0) ]. The results don't contain the page https://sqlite.org/fts5.html. I expected it to be listed since there is "row that contains" phrase there, but since the page also contains "row contains", it is omitted (for the query [ NEAR(row contains, 1) ] the page is listed) But in real life scenarios lacking some of results might be ok

It's interesting that "NOT NEAR" is virtually not discussed on the web, (see google's few results of [ "fts5" "NOT NEAR" ] ). But LLMs still manage to suggest sometimes NOT NEAR when asked about my issue. The evidence that it is not "borrowed" from a discussion but synthesized is for example perplexity.ai, that is notorious in making references as often as possible. It lacks ones if this suggestion is made