FTS5 Handling of multi-token phrases

(1) By gfc628 on 2020-11-30 15:07:43 [link] [source]

SQLite 3.28.0 or 3.32.2

In the doc, I read:

A phrase matches a document if the document contains at least one sub-sequence of tokens that matches the sequence of tokens that make up the phrase.

I understand this as a means to gently degrade the source query into subqueries using subsets of the source tokens. E.g. if no result is found with:

MATCH "one two three"

then try:

MATCH "one two"

MATCH "one three"

MATCH "two three"

then, if no success, try:

MATCH "one"

MATCH "two"

MATCH "three"

But actually, it is not behaving like this. If the implicit AND between the tokens in the phrase "one two" doesn't succeed, then no try is done with the tokens alone ("one" OR "two"), i.e. no OR is generated on the individual tokens.

Is this behaviour expected? Does it reflect appropriately the documentation?

Thanks

(2.1) By TripeHound on 2020-11-30 15:58:54 edited from 2.0 in reply to 1 [link] [source]

I believe you're reading things that aren't there.

if the document contains at least one sub-sequence of tokens

To me means split each document into tokens (e.g. "aaa bbb ccc ddd eee"). Look for all sub-sequences (e.g. "aaa bbb ccc", "bbb ccc", "bbb ccc ddd" etc.) and see if any of those match

the sequence of tokens that make up the phrase.

Here, there's no sub-sequencing mentioned; certainly no "degrading into subsets". If the phrase is "xxx yyy zzz" then only that sequence (somewhere in the source document) will match. A document containing just "xxx yyy aaa" won't match; neither will "bbb yyy zzz".

(4) By gfc628 on 2020-11-30 18:16:03 in reply to 2.1 [link] [source]

Alright then. Blame my poor understanding of English and/or of query parsing!

Actually, I had thought FTS had been working the way I said and that maybe something had changed. Sorry about that.

I guess I'll come up with some ad hoc solution to progressively degrade a multi-token phrase, as I would want to match a document with "xxx yyy aaa" if the query is "xxx yyy zzz" (possibly with a lower relevance score).

Thanks for clarifying.

(3) By Dan Kennedy (dan) on 2020-11-30 16:01:23 in reply to 1 [link] [source]

A phrase matches a document if the document contains at least one sub-sequence of tokens that matches the sequence of tokens that make up the phrase.

This is meant to mean that the phrase "one two three" matches documents that contain an instance of the token "one", followed by an instance of the token "two", followed by an instance of the token "three".

Would it have made sense if "sub-" were removed from that sentence in the docs?

Phrases are different from sequences of tokens (really, sequences of single-token phrases) connected by implicit AND operators. This is a phrase query:

  ... FROM fts5tbl WHERE fts5tbl MATCH '"one two three"'

These are both queries for [one AND two AND three]:

  ... FROM fts5tbl WHERE fts5tbl MATCH 'one two three'
  ... FROM fts5tbl WHERE fts5tbl MATCH "one two three"

The only difference is that the second form is more exciting, as sometimes strings turn into column identifiers:

https://sqlite.org/quirks.html#double_quoted_string_literals_are_accepted

Dan.

(5) By gfc628 on 2020-11-30 18:22:33 in reply to 3 [link] [source]

Thanks for this. I guess I read the documentation a little too fast.

I'll try to find out how to best degrade in a controlled way a complex source query, to minimize silence. (This issue is not new!)

Thanks.

(6) By Dan Kennedy (dan) on 2020-12-01 11:10:34 in reply to 5 [source]

This might be a bit closer to what you need:

    SELECT * FROM fts5tbl('one OR two OR three') ORDER BY rank

https://www.sqlite.org/fts5.html#sorting_by_auxiliary_function_results

Dan.