zero width space

(1) By Gert Van Assche (GertVA) on 2021-04-25 11:35:40 [link] [source]

Hi All, I find in Indian texts 'u200d'. It appears for instance in 'अमिताभ बच्u200dचन'. This u200d seems to be a zero width space. I cannot replace it by a space, and I cannot replace it by nothing because then it will no longer be a substring found in another string. So I need to replace it by an actual character, even if this won't show up on the screen.

Does anyone know how to go about this? Is there a function in SQLite to convert a unicode entity to an utf-8 character?

thanks

gert

(2.3) By Warren Young (wyoung) on 2021-04-25 13:33:21 edited from 2.2 in reply to 1 [source]

This u200d seems to be a zero width space.

It's a zero-width joiner, actually. A zero-width space is something else.

convert a unicode entity to an utf-8 character?

All UTF-8 characters are Unicode entities.

Are you perhaps saying you have a literal "u200d" string (5 ASCII characters) in the input and need to turn it into an actual Unicode character, encoded as UTF-8? If so, then presuming the terminal locale is UTF-8:

$ sqlite3 :memory: "select replace('अमिताभ बच्u200dचन', 'u200d', char(0x200D))"
अमिताभ बच्‍चन
$ !! | od -t c
0000000    अ  **  **   म  **  **   ि  **  **   त  **  **   ा  **  **   भ
0000020   **  **       ब  **  **   च  **  **   ्  **  ** 342 200 215   च
0000040   **  **   न  **  **  \n                                        
0000046

That "342 200 215" bit is the octal representation of the character 0x200D encoded in UTF-8.

In other words, you might be looking for a combination of replace() and char().

EDIT: Shortened example code for clarity. Also, bang-bang!

(4) By Gert Van Assche (GertVA) on 2021-04-25 19:37:31 in reply to 2.3 [link] [source]

Thank you so much for your help. This is indeed what I need to do.

(3) By Richard Damon (RichardDamon) on 2021-04-25 12:58:30 in reply to 1 [link] [source]

Uincode Codepoint U+200D is defined as a Zero-Width Joiner. It should take no space on the screen and tells the glyph system that the characters on either side of it rather than being 'independent' as they normally would be, are really together. (This seems mostly used with Emoji).

I don't know Indian coding to know what it is supposed to mean in that context, and maybe the issue is that your data source is doing something wrong and there shouldn't be a joiner there.

The question becomes, why do you think you need to change it?

If it is an encoding error from your source, then you just need to either keep the error everywhere or remove it everywhere,

SQLite itself won't care about it.

(5) By Gert Van Assche (GertVA) on 2021-04-25 19:39:11 in reply to 3 [link] [source]

Thanks Richard. You're right.

I'm comparing 2 strings, and one was processed correctly, the other one wasn't. So got a lot of non-matching hits that were due to this. Now it's solved.

(6) By Richard Damon (RichardDamon) on 2021-04-25 23:28:19 in reply to 5 [link] [source]

One other thing to watch out for if you are comparing strings is you may want to convert them to a 'canonical form, as some letters (with 'accent' marks) can be represented with different sequences of code points, either composed as a single code point or separate code points for the base character and the accent.