SQLite User Forum

sqlite-wasm: Register custom FTS3/4 tokenizers?
Login

sqlite-wasm: Register custom FTS3/4 tokenizers?

(1) By Johannes Baiter (jbaiter) on 2023-07-27 18:21:44 [source]

I'm using the WASM build of SQLite in an application and would like to provide a custom tokenizer implementation for a FTS4 virtual table. Is it possible to create a sqlite3_tokenizer_module struct from JavaScript and then bind a pointer to that to the SELECT fts3_tokenizer(?1, ?2) query? Or do I have to implement the custom tokenizer in C and build a custom WASM build of SQLite?

(2) By Stephan Beal (stephan) on 2023-07-27 19:31:25 in reply to 1 [link] [source]

I'm using the WASM build of SQLite in an application and would like to provide a custom tokenizer implementation for a FTS4 virtual table.

The canonical wasm build does not include FTS3/4 support, it has only FTS5. It was collectively decided early on that for this brand-new platform, supporting older or deprecated features was an unnecessary maintenance burden, under the [apparently incorrect] assumption that anyone working with wasm would be working with "the latest stuff."

Or do I have to implement the custom tokenizer in C and build a custom WASM build of SQLite?

That would be required for FST3/4 and would currently be required for FTS5. The JS API does not currently support custom tokenizers but there appears to be no fundamental reason it can't. Consider that added to the TODO list. FTS is still largely voodoo to me, so i cannot say with certainty whether this will be in the 3.43 release, but i'll strive to get it done before then.

(3) By Johannes Baiter (jbaiter) on 2023-07-28 15:01:54 in reply to 2 [link] [source]

The canonical wasm build does not include FTS3/4 support, it has only FTS5. It was collectively decided early on that for this brand-new platform, supporting older or deprecated features was an unnecessary maintenance burden, under the [apparently incorrect] assumption that anyone working with wasm would be working with "the latest stuff."

Good to know! I was using the unofficial @sqlite.org/wasm version, which seems to include FTS3/4 (at least it works).

FTS5 brings me to another thing: My use case requires retrieving the offset of the matching terms from the FTS index. FTS3/4 have the auxilliary offsets function that does just that. FTS5 has all the data structures required to retrieve them as well, but does not come with an existing auxilliary function to make this possible out of the box from userspace. There is, however, an API to register custom auxilliary functions from C, it would be great if this could be exposed via a JavaScript API as well.

SQLite is probably the most capable client-side search engine available at the moment, would be great to have as much feature-parity as possible :-)

Consider that added to the TODO list. FTS is still largely voodoo to me, so i cannot say with certainty whether this will be in the 3.43 release, but i'll strive to get it done before then.

Thank you so much, very much appreciated 🙏

(4.2) By Stephan Beal (stephan) on 2023-08-03 19:13:54 edited from 4.1 in reply to 3 [link] [source]

There is, however, an API to register custom auxilliary functions from C, it would be great if this could be exposed via a JavaScript API as well.

Follow-up: work has started on integrating FTS5 customization into the wasm build. Auxiliary functions are essentially working but (A) are only marginally tested and (B) there's currently a memory leak involving the xDestroy() callbacks. In short: the xDestroy() callbacks for any FTS5 auxiliary function will be called but the xDestroy() WASM/JS bindings of those functions will (somewhat ironically) leak after the db is closed. How to solve that is currently unclear. (edit: fixed! (edit: maybe!))

The obligatory caveats:

  1. It's far from complete. It can register new auxiliary functions but nothing else yet (no custom tokenizers). Baby steps.

  2. Only features which are exposed via sqlite3.h (as opposed to via out-of-amalgamation files) will be included. We have a strict policy of "if it's not included in sqlite3.c then it does not go into the wasm build."

  3. There's still no guaranty that it will reach the trunk before the 3.43 release but it's still my goal to do so.

There's still lots to do here before it's suitable for release.

Good to know! I was using the unofficial @sqlite.org/wasm version, which seems to include FTS3/4 (at least it works).

FWIW, we do not expect our particular build of sqlite3.wasm to be "the" solution. It's offered as "a" solution and we hope that community members will post any variants which are better suited to their needs. At the project level we don't put any significant weight on "official" vs "unofficial" and welcome third-party solutions (so long as the support requests for them go to their own forums ;)).

(5) By Stephan Beal (stephan) on 2023-08-10 03:49:10 in reply to 2 [link] [source]

The JS API does not currently support custom tokenizers but there appears to be no fundamental reason it can't. Consider that added to the TODO list.

Follow-up: there's good news and lesser good news, and i'll start with the latter...

This won't be done before the 3.43 release. We have conflicting priorities which will keep me away from this for the near-term future. It turns out that binding FTS5 customization in any meaningful way requires binding a whole lot more than just the fts5_api API. Unless the whole Fts5ExtensionApi is also exposed, the fts5_api pieces aren't of any real use to clients. Binding the complete custom auxiliary function API is definitely on the TODO list (and has already started), but custom tokenizers are not currently on that list.

The better news, however, is that one of those conflicting priorities includes working on that same feature for Java and that effort is giving me a much better idea of what needs to happen for the eventual JS implementation. Once it's known to work, it will provide an excellent template to create the JS variant from.

My apologies for the delay in getting this feature to you, but it hasn't been forgotten and is a definite TODO, as opposed to a "potential TODO" or a "might someday eventually consider making it a TODO."

(6) By Johannes Baiter (jbaiter) on 2023-08-13 22:16:27 in reply to 5 [link] [source]

My apologies for the delay in getting this feature to you, but it hasn't been forgotten and is a definite TODO, as opposed to a "potential TODO" or a "might someday eventually consider making it a TODO."

No worries at all, thank you for getting on this so quickly! 🙏