.mode box does not seem to like UTF-8

(1) By anonymous on 2020-10-01 19:44:05 [link] [source]

Running the following test

Create Table Tracks (id,text);
Insert Into Tracks Values( 1,'夜明~朝ごはんの歌 Yoake ~ Asa-Gohan No Uta (Sunrise - The Breakfast Song)');
Insert Into Tracks Values( 2,'朝の通学路 Asa No Tsūgakuji (Off To School In The Morning)');
Insert Into Tracks Values( 3,'馬鹿騒ぎ Bakasawagi (A Big Commotion)');
Insert Into Tracks Values( 4,'追憶 Tsuioku (Reminiscence)');
.mode box
.once unibox.txt
Select id,
       Length(text) As Characters,
       Length(Cast(text As BLOB)) As Bytes,
       text 
  From Tracks Order By id;

I get this output:

┌────┬────────────┬───────┬───────────────────────────────────────────────────────────────────┐
│ id │ Characters │ Bytes │                               text                                │
├────┼────────────┼───────┼───────────────────────────────────────────────────────────────────┤
│ 1  │ 65         │ 81    │ 夜明~朝ごはんの歌 Yoake ~ Asa-Gohan No Uta (Sunrise - The Breakfast Song) │
│ 2  │ 53         │ 64    │ 朝の通学路 Asa No Tsūgakuji (Off To School In The Morning)             │
│ 3  │ 33         │ 41    │ 馬鹿騒ぎ Bakasawagi (A Big Commotion)                                 │
│ 4  │ 25         │ 29    │ 追憶 Tsuioku (Reminiscence)                                         │
└────┴────────────┴───────┴───────────────────────────────────────────────────────────────────┘

It appears .mode box gets confused about the length of the text column.

(2) By Keith Medcalf (kmedcalf) on 2020-10-01 20:47:17 in reply to 1 [link] [source]

Your unicode glyphs are wider than ASCII monospace glyphs hence they "hang over" the edge of the seat pushing the folks at the end of the row right out the airplane window.

(3) By Larry Brasfield (LarryBrasfield) on 2020-10-01 21:34:38 in reply to 1 [link] [source]

If you truly need prettified results, you can use HTML table output and view it in a browser. Or use the open-in-spreadsheet option with the .once command. (I'm not saying that's particularly pretty, but the borders will stay straight.)

(4) By anonymous on 2020-10-01 22:38:12 in reply to 2 [link] [source]

Ok, that is a very graphic and clear explanation.
Now I see this is not a confusion between lengths in bytes or characters.
A solution would be a monospace font with all glyphs the same width; even if a font like that exists finding it will take some effort.
Luckily I have alternatives, but this box mode seemed like a nice option.

(5) By anonymous on 2020-10-01 22:41:02 in reply to 3 [link] [source]

Yes, HTML and PDF are solutions I am using now, but box mode would have been a nice quick alternative.
And opening in a spreadsheet ... well that is another can of worms.

(6) By anonymous on 2020-10-02 14:57:24 in reply to 4 [link] [source]

A solution would be a monospace font with all glyphs the same width

The problem is, some code points are defined to result in double-wide glyphs, even in monospace fonts. They have to be like that in order to be rendered correctly. The alternative would be to allocate fullwidth space even for halfwidth ("normal" for you and me) characters, which would also look bad.

In theory, SQLite shell could have a look-up table to calculate box widths correctly, but it would take up too much space.

(7) By anonymous on 2020-10-02 18:12:11 in reply to 6 [link] [source]

This is one of the things that makes Unicode bad for terminal emulators. (My opinion is that no single character set is suitable for all purposes.)

Nevertheless, there are other ways to do it:

If the locale does not specify UTF-8, replace all non-ASCII characters (and all control characters, too) with a substitute (perhaps position 0x61 in the DEC special graphics set) when running in interactive mode.
If the locale does specify UTF-8, use the escape codes to save and restore the cursor position, and clear to the right of the cursor at the end of the line, in order to ensure it lines up even if the text may get cut off (unless you specify the width of each column explicitly).

(8.1) By Keith Medcalf (kmedcalf) on 2020-10-02 19:19:11 edited from 8.0 in reply to 7 [link] [source]

Well, no. You will experience the same "issue" if you use any proportional spaced font. When your so-called terminal renders non-ASCII characters it renders them using a proportional font.

If you render all characters using a proportional font you will experience the same result. The "root cause" is that the boxing modes are designed for fixed width fonts and not for proportional fonts. If you wish to use proportional fonts then you need to use something designed to use proportional fonts.

This is why hooey-gui clickety-pokey's were invented.

(9) By RandomCoder on 2020-10-02 21:41:26 in reply to 7 [link] [source]

This list of characters that will be double-width is known. For instance, some Python based projects use wcwidth to know when a character will take two spaces in the output.

I may take a stab at patching the shell to use this information for my own use at least. I'm not sure if the SQLite team wants to add such a feature to the shell.

(10) By anonymous on 2020-10-04 19:38:57 in reply to 9 [source]

This list of characters that will be double-width is known.

While that is true, don't they sometimes add new characters into Unicode? (And you might want characters which aren't in Unicode. To represent them as Unicode, you would need to use private codepoints, and unless ranges are defined as narrow/wide, this won't work so well.) Also, some characters have an ambiguous width, and they have occasionally changed some.

(My own solution is to not use Unicode, but some people will use Unicode anyways.)

(11.1) By Warren Young (wyoung) on 2020-10-05 09:31:10 edited from 11.0 in reply to 10 [link] [source]

don't they sometimes add new characters into Unicode?

Yes, which is why libraries like iconv and ICU get updated every time a new version of Unicode is published. If you try to feed Unicode 12 data to a machine running Unicode 8 libraries, you can expect failures. I don’t see how that’s any different from any other common data format versioning case, though.

you might want characters which aren't in Unicode

Such as...?

some characters have an ambiguous width,

I don’t know about that. I think it’s more the case that this varies by font, and a terminal based program cannot reliably know what font the terminal is using. It can hard-code guesses and heuristics, but it can never be certain. This is the real reason you will never find a general solution to this problem.

If you have to solve it, you need to move to a medium where you can programmatically ask questions like, “How long is this string when rendered in this specific font?” (e.g. HTML rendered in a browser, as recommended previously in this thread.)

My own solution is to not use Unicode

Yay provincialism!

some people will use Unicode anyways

You’re using it right now. You just aren’t using every character Unicode defines.

(12) By anonymous on 2024-04-11 03:18:09 in reply to 6 [link] [source]

It won't take up too much space. On GNU C platforms, you can just use wcwidth(3) and its friend wcswidth(3) from glibc. Portable implementations are also available like termux/wcwidth, which is under 30KiB of source code. And that's the size of the source code. It is much smaller when compiled.

Other than wcwidth, you also need a function to decode UTF-8 or whatever multibyte encoding that your locale uses to UTF-32 or UCS-4, which could be done in just a couple of lines of code.