SQLite Forum

Joining on Hebrew words including vowel points and cantillation marks
For languages where individual characters can be modified when combining text, including Hebrew and Arabic, but also other languages which have special forms for initial and final characters, it is recommended that you convert to normalised form on input.  In other words, your files should contain only normalised forms.  Any good Unicode library should include normalisation conversion routines.

This makes it easier to spot faulty processing on input, since a dump of the database will 'look wrong'.  And it reduces processing because data can be input only once, but may be output many times.

Japanese text also benefits from this, since it means all text will be in half-with form, instead of whatever bizarre mixture your users might enter.  I learned this the hard way, when my carefully formatted output became crazy when users cut-n-pasted Japanese text.  I'm told that languages like Bengali and Tamil benefit too, but I don't know enough about them.