SQLite Forum

FTS5: calculating column weight dynamically?
Login

FTS5: calculating column weight dynamically?

(1) By Stephan Beal (stephan) on 2020-05-26 06:37:14 [link] [source]

What i'm about to ask isn't an attempt to solve a specific problem, but i'm curious whether it's at all possible (if it is, i may apply it some rainy day, otherwise i'm just curious whether it's possible at all)...

My website uses, unsurprisingly, sqlite's FTS5 for its site search feature. One of the interesting aspects of FTS5 is the ability to give more or less weight to any given content column when searching by using the bm25 function, as demonstrated in this snippet from my site's search script:

https://fossil.wanderinghorse.net/r/www-wh/file?udc=1&ln=28-33&ci=tip&name=cgi-bin%2Fsearch.s2

In short, what that does is give more weight to search terms which are found in the URL path than it does to those same terms found in page-level content.

Now i'm wondering if this can be taken to The Next Level and modify the weight based on what level of the site's directory hierarchy the search term is found. The idea is that terms found on deeply-nested pages should weigh more than those same terms found on higher-up pages, under the presumption that they are more specific matches (because that pattern applies, generally speaking, to the whole site).

For example, searching for "painting" might find a hit on...

  • 1st level: a mention in the content of the /news page
  • 2nd level: a URI path element (/gaming/painting) and several mentions in page content.
  • 3rd level: as for (2), at /gaming/painting/airbrush, plus more page-level mentions on the airbrush page.

Ideally, assuming that the content-level mentions of (2) and (3) are roughly equivalent, they would be weighed such that their relative weights put (3) as the heaviest, (2) as the next one, and (1) in distant last place.

Is such a dynamic interpretation of FTS weights at least hypothetically possible?

Something along the lines of:

weight for each hit = hit in page content (1 * directory level) + hit in uri path element (3 * directory level)

:-?

For completeness's sake, though these probably aren't necessary for the discussion:

(2) By Dan Kennedy (dan) on 2020-05-29 10:59:57 in reply to 1 [link] [source]

Missed this when it was first posted a few days ago.

I don't see any reason that is not possible, if I understand correctly. You might need to write new FTS5 custom function. The built-in bm25() function is just an example of an FTS5 custom function, which is really just an SQLite scalar user function that has access to this API:

https://sqlite.org/fts5.html#custom_auxiliary_functions

That API can read any columns of the row the custom function is operating on (calculating a match score for), so if it can determine the nesting level based on that, it could return an adjusted match score as desired. Or, if the nesting level information is found elsewhere in the database you could pass it in to each invocation of the custom function via one of its trailing arguments.

One limitation is that if you want to install the custom function so that fts5 queries can use (ORDER BY rank) instead of "ORDER BY customfunction()", which is more efficient, all trailing arguments must be constants. So the nesting level would have to be read from the current row in that case.

Another is that it is more efficient to avoid accessing table data from FTS5 custom functions (as this requires a table lookup). Encoding the nesting level using some bits from the 64-bit fts5 table rowid value is one way around this.

Dan.

(3) By Stephan Beal (stephan) on 2020-05-29 11:13:22 in reply to 2 [source]

The built-in bm25() function is just an example of an FTS5 custom function, which is really just an SQLite scalar user function that has access to this API:

Thank you, Dan. i had forgotten that it's possible to customize that via custom functions - my thought train was along the lines of long/convoluted CASE blocks. This is yet another rabbit hole for me to crawl into :-D.