FTS: particular order of NEAR queries possible? (after and not before)

(1) By Vadim Goncharov (nuclight) on 2024-12-22 16:40:20 [link] [source]

I want to utilize either FTS3/FTS4 or FTS5 for querying XPath-like expressions on trees of objects which can have some class, e.g. '//.[isa("Telegram::Message")]/peer_id/channel_id[value==1383907943]/../../../../../..' - of course, not exactly that, but as close to as ever possible with SQL. To this, I think about table structure similar to below (not all tables shown, object and key/value pairs omitted). First, let's define auxiliary class information table, with base/child class relationships:

id (INT)	parent (INT)	class_name (TEXT)
0	NULL	''
1	NULL	'Telegram::Message'
2	1	'Telegram::ServiceMessage'

Then, Object to leaf table like:

object_id	ipath_id	leaf_id
INT	INT	INT

And, of course, the most interesting part...

Internal paths table USING fts5:

ipath_id (INT)	name_path (TEXT)	class_path (TEXT)
0	''	''
1	'3'	'0'
2	'a b'	'cd ef'

This is a table with two text columns indexed by full text search. Many attempts to save space are made. Only unique paths are placed here, having just a compact integer ID in referencing leaves. Root and leaf are never part of path, so for common case "object is just a hash with no subobjects" the path will be empty.

Then, IDs constituting path from Attribute_name table are base32-encoded, most significant bit will show, is it array index or hash key. And this "words" are placed into name_path column.

For classes, things are more complicated, given that we have to support inheritance. Zero class is no class (for just plain hashrefs and arrayrefs). All others are constructed as words from Unicode characters, where each character, from left to right, has number (ord()) from class table, in order of inheritance chain.

For example, if class number encoded to 'd' character is child of class whose number is encoded to 'c' character, and current element in path is blessed to 'd' class, then word becomes 'cd'. If it is 'e' class, whose parent is 'd' - then 'cde', and so on.

This allows to make FTS queries like "c*" - so FTS engine will match base class and all it's descendants for us. If we need only 'd' class, we'll put "cd*" in query, and so on.

And finally the questions:

Is it possible in any FTS variants to use NEAR in particular order, e.g. "this" phrase must occur only after "that" phrase, not before? Because if matching, roughly, "A/B" NEAR "C/D" returns match for both "A/B/C/D" and "C/D/A/B", that would be very different objects.

Or it is better to combine name & class in one text string for more properly doing the task?

Or this type of queries should be done by entirely different way?

(2) By Max (Maxulite) on 2024-12-25 21:05:06 in reply to 1 [source]

I assume that for your queries you don't need to use fts5 syntax queries and use fts5vocab module companion. In this case you can get the list of rowids using general sql queries against this virtual table.

The best fit is the instance variant

CREATE VIRTUAL TABLE FtsIndexInstance USING fts5vocab(FtsIndex, instance);

since it returns full information including offsets.

Theoretically this module can be used to return the same results as available with the full text syntax and even more, but using it effectively can be tricky because you're working with general rows of data from a virtual table and sql queries that together when unoptimized might multiply the execution time significantly.

You asked about forcing the order or terms for NEAR queries, but I will give an example with an exact distance between the terms. Actually I even asked about it, but in the context of extending fts5 syntax. But you will get the idea.

The following queries assumes searching for a distinct docs containing 'foo' 'bar' with exact distance (5) between them The naive approach is to use a join

select distinct fi2doc from
(
select * from 
  (select * from FtsIndexInstance fi1 where fi1.col='text' and fi1.term='foo') fi1
join 
  (select *, doc as fi2doc from FtsIndexInstance fi2 where fi2.col='text' and fi2.term='bar') fi2
on 
  fi1.doc=fi2.doc and fi2.offset-fi1.offset=5
)

and it will work fast if your terms (foo bar) are rare enough, but if they are frequent and the index is big the duration will increase significantly because (you can see it in the query plan), SQLite will use sequential enumeration of the rows.

But the time can be decreased if one uses left join instead of join. In this case SQLite starts to use a bloom filter. In my case for a big index with frequent words the previous one took many minutes, while the one below only 30 seconds

select distinct fi2doc from
(
select * from 
  (select * from FtsIndexInstance fi1 where fi1.col='text' and fi1.term='foo') fi1
left join 
  (select *, doc as fi2doc from FtsIndexInstance fi2 where fi2.col='text' and fi2.term='bar') fi2
on 
  fi1.doc=fi2.doc and fi2.offset-fi1.offset=5
) where fi2doc is not null

But the time is still not ideal if the terms are frequent. The significant improvement is possible if we use intersect

select distinct doc from
(
select * from
(
  select doc, offset +5 from FtsIndexInstance where col='text' and term='foo'
) intersect
  select doc, offset from FtsIndexInstance where col='text' and term='bar'
)

This one for my comparatively big index and very frequent terms gave the result in less than a second.

Also you may find your own tricky solutions if these approaches is not enough