Truncated Words bug with fts5vocab

(1) By J. Rayner-Hilles (occultus73) on 2021-10-21 16:59:46 [link] [source]

Hi there; new to the forum.

Can anyone recreate this bug I have found with FTS5?

Steps:

1. Load FTS5 Extension, if not included in the build.

2. Create a normal table with at least two columns called 'name' and 'keywords':

CREATE TABLE content (name TEXT, keywords TEXT);

3. Create a fts table based on that table:

CREATE VIRTUAL TABLE content_fts
USING fts5(name, keywords, content='content', tokenize='porter ascii');

4. Create a fts5vocab table based on that in turn:

CREATE VIRTUAL TABLE content_vocab
USING fts5vocab('content_fts','row');

5. Add a trigger to insert data into the fts table automatically:

CREATE TRIGGER content_ai AFTER INSERT ON content
BEGIN
	INSERT INTO content_fts (name, keywords) 
	VALUES (new.name, new.keywords); 
END;

6. Insert data containing the words 'Sample' and 'Wallpaper' into the table:

INSERT INTO content VALUES ('Sample Gif', 'content,sample,gif');
INSERT INTO content VALUES ('Sample Wallpaper', 'content,sample,wallpaper');

7. Query the content_vocab table,to see 'sample' and 'wallpaper' are truncated:

SELECT * FROM content_vocab

content	2	2
gif	1	2
sampl	2	4
wallpap	1	2

(2) By Richard Hipp (drh) on 2021-10-21 17:28:28 in reply to 1 [link] [source]

The Porter Stemmer algorithm does that. It is a feature, not a bug. The idea of a "stemmer" is to map words that share a common root into the same base form, so that they appear to the search algorithm as the same word.

The Porter Stemmer algorithm (named for its inventor, Martin Porter), only works for English. But it has been widely used for over 4 decades and works fairly well within its domain.

If you don't want to use the Porter Stemmer, leave off the "porter" keyword when you declare the FTS5 table.

(3) By J. Rayner-Hilles (occultus73) on 2021-10-22 10:27:12 in reply to 2 [link] [source]

As it happens, for my project I'm not actually using Porter's stemmer, but rather the custom Snowball stemmer in a modified build of FTS5. It seems that this too has the same feature in this instance.

The problem for me here is I was using the fts5vocab table to top up the Spellfix1 vocabulary table, as per the documentation - although technically there, fts4aux is specified, but I understand fts5vocab is a replacement for that.

As it is, Spellfix1 is "correcting" misspellings to these roots as you say, which isn't really what one expects of a spellfixer. That said, seems as I spellfix search text in order to improve querying against the fts tables, and the fts tables will necessarily understand these roots, then perhaps spell correcting to a root is relatively harmless.

(4) By Simon Slavin (slavin) on 2021-10-22 14:07:47 in reply to 3 [source]

The stemmer deliberately changes the word you supply into something else. Stemmers you might see in action right now might produce something that looks like the original word, but the stem is an internal representation, for use by the computer only. When doing your programming you should imagine that the stemmer takes the word 'something' and turns it into '174JGS72'.

You need to keep both strings of text: the original, as supplied by your program/user, and whatever your stemmer produces. They are not the same thing, and if you want them both you have to store them both.