The content is a transliteration of Arabic names, so there are general conventions that are followed: [This page](https://en.wikipedia.org/wiki/Romanization_of_Arabic) shows a mapping of some conventions used. I agree that it would be great to create a tokenizer for this. How about a tokenizer that follows a "whitelist" approach on what ranges are valid characters, and everything else that are ignored? >Maybe it should be stripped out before tokenization or something Doesn't this mean that we would have to duplicate the content into another field and build the fts5 virtual table off of that column if I wanted to retain the original symbols? This sounds like it would be more space-consuming.