Joining on Hebrew words including vowel points and cantillation marks

(1.1) By Gary (1codedebugger) on 2021-05-20 01:47:34 edited from 1.0 [link] [source]

Hello, I'm working with three different sources of Hebrew and would like to join the tables based on the Hebrew words. I'm pretty sure everything is in UTF-8 and all three sources claim to have used the exact same source for the Hebrew but many words don't match, yet they appear identical on the screen including vowel points and cantillation marks.

For example, for the word below, each source displays the exact same word visually, but if they are written out character-by-character, it appears that they are built in a different order. The right most letter in one table is built as first the reversed-looking C symbol, then the dot inside of it, and then the double stacked dots below it. In another table, the stacked dots are second and the single dot last.

בְּרֵאשִׁית

Is there anything that can be done to be make them match for a join? Or is it a case of bytes are bytes and, if their not in the same order, they're different no matter how they render?

Thank you.

(2) By anonymous on 2021-05-20 03:11:45 in reply to 1.1 [link] [source]

Normalization in Unicode is not a trivial topic. I think you want to store, or at least compare, normalized forms of the words in your tables.

(3) By Gary (1codedebugger) on 2021-05-20 04:10:01 in reply to 2 [link] [source]

Thank you. I've never heard of any of this before and it certainly sounds like the issue I have with these Hebrew sources. Tcl and JavaScript appear to have normalization methods. The Tcl link looks like it hasn't been updated in a long time though. ( https://core.tcl-lang.org/tcllib/doc/trunk/embedded/md/tcllib/files/modules/stringprep/unicode.md ) Thanks again.

(4) By Gary (1codedebugger) on 2021-05-22 05:11:40 in reply to 1.1 [source]

This may not be an area in which many people are interested, but I was given an answer on stackoverflow concerning how to accomplish this in Tcl and Linux.

The answer is here if links are permitted. (https://stackoverflow.com/questions/67629309/unicode-normalization-of-hebrew-in-tcl-sqlite-or-whatever-else-will-work-in-li/67635812#67635812)

The main part is this procedure provided by Donal Fellows.

proc normalize {string {form nfc}} {
    exec uconv -f utf-8 -t utf-8 -x "::$form;" << $string
}

(5) By Simon Slavin (slavin) on 2021-05-22 12:27:52 in reply to 4 [link] [source]

For languages where individual characters can be modified when combining text, including Hebrew and Arabic, but also other languages which have special forms for initial and final characters, it is recommended that you convert to normalised form on input. In other words, your files should contain only normalised forms. Any good Unicode library should include normalisation conversion routines.

This makes it easier to spot faulty processing on input, since a dump of the database will 'look wrong'. And it reduces processing because data can be input only once, but may be output many times.

Japanese text also benefits from this, since it means all text will be in half-with form, instead of whatever bizarre mixture your users might enter. I learned this the hard way, when my carefully formatted output became crazy when users cut-n-pasted Japanese text. I'm told that languages like Bengali and Tamil benefit too, but I don't know enough about them.