SQLite Forum

Use custom string function in FTS5 match clause
Login

Use custom string function in FTS5 match clause

(1) By Fenjin Wang (wangfenjin) on 2021-02-02 01:29:01 [link] [source]

Hi, I build a FTS5 tokenizer to support Chinese and PinYin, and also provide a custom function to generate the match string from user's input.

It all works well except the input contains a single quote, at this case it can't be matched. But if I first select my_custom_function(input), then use the returned value in match clause, it works. Here are the code pieces:

sqlite> select simple_query('''');
"''"
sqlite> select '    ', simple_highlight(t1, 0, '[', ']') from t1 where x match simple_query('''');
sqlite> select '    ', simple_highlight(t1, 0, '[', ']') from t1 where x match '"''"';
    |@English &special _characters."[']bacon-&and[']-eggs%

Please help me on this. Thanks in advance!

(2.1) By Larry Brasfield (LarryBrasfield) on 2021-02-02 02:32:02 edited from 2.0 in reply to 1 [link] [source]

What I gather from your post is that the middle query of the above 3 fails to yield the results you hope to see. You have presented a lot of code to read, which indicates that you have not narrowed this problem much, if at all.

One problem that does appear from a superficial study is this: Your function, simple_query(), when given the SQL literal '''' as its argument, appears to return either the SQL empty string literal token (two successive single-quotes), or that token bounded by double-quotes. [a] Neither one is likely to be a suitable right operand for the match operator.

[a. Showing the .mode meta-command output would show whether the double-quotes are being emitted by the sqlite3 shell or by your simple_query() function. ]

I think maybe your problem is that you expect that a pair of concatenated single-quotes, as a string value, should be interpreted as a length 1 string value containing a single-quote character. That is not so. The conversion of a SQL literal such as '''' to that length 1 string value happens when the SQL is parsed (or lexically analyzed, actually), and not as its parsed and compiled form is executed. Of course, the return from your custom function known as simple_query() is never passed through the SQL lexer or compiler; instead it is made, (by the compiler's output code), to participate in evaluation of the binary match expression with whatever string value it has, without any quoting, quote-escaping, or any such finagling.

(3) By Fenjin Wang (wangfenjin) on 2021-02-02 03:25:45 in reply to 2.1 [source]

Thanks Larry, you point me to the right direction!!!

After turn on quote mode, I see actually sqlite will try to escape single quote again, which means if my output is "''", it will make it to "''''"

sqlite> .mode quote
sqlite> select simple_query('''');
'"''''"'

All I need is return "'", and let sqlite escape for me, Thats awesome!

(4.1) By Larry Brasfield (LarryBrasfield) on 2021-02-02 03:53:26 edited from 4.0 in reply to 3 [link] [source]

Once your SQL is compiled and executing, there will be no "sqlite escape" action. The escaping is something SQL authors do and SQL lexers undo.

Yes, all you need (apparently, as your 3rd query success indicates) is for simple_query() to return a length-3 string, consisting of the character sequence: double-quote, single-quote, double-quote. (I write that out because, in this discussion, conventional quoting tends to confound the issue.)

That length-3 string value will not be subjected to any more SQL language spelling or representation transformations, such as "escaping" or its inverse. The value will be passed, without alteration, as an operand to whatever function you have caused to implement the "match" operator.

I realize you may have already understood all this. I expound on it because (in my opinion and experience) it is very important to distinguish between how things are represented and what they are at a deeper level. Such is the distinction between a string literal and a string value. One is a way of writing a string value so that a compiler (or lexer) may correctly interpret it; the other is a sequence of characters treated as an object in some hardware or a virtual machine.