A new option for CLI: -utf8
(1.2) By Larry Brasfield (larrybr) on 2023-04-15 18:59:05 edited from 1.1 [source]
[Edited to remove historical facts not true now, add discussed examples, and reflect current development state.]
A feature has been added to the CLI in the SQLite source repository. It should appear in the next non-patch release.
What the feature does is to allow a new CLI invocation option, "‑utf8", which can have useful effect on Windows and is silently ignored (and not advertised) on other platforms.
When the option is provided to sqlite3.exe, and the input is interactive and from the console, the console "code page" is set to CP_UTF8 (aka "65001") if possible1, the stdin stream is forced to wide Unicode character mode, console I/O is set to a line input mode, and input operations convert to UTF‑8 for the CLI's internal use. This setup is undone on normal exits.
The effect of that console setup is that Unicode or UTF-8 clipboard content can be pasted into the console as CLI input, where it may be rendered sensibly2 and be interpreted as the same character sequence as was held in the clipboard. Typing characters that require multi-byte UTF‑8 representation should also work. The simple rule is that if it looks (or renders) right on input, the CLI input gathering will get the character codes corresponding to the glyphs seen during entry.
It should be noted that making this work requires several interactions with things beyond the simple FILE* that console applications have been able to use successfully for plain text I/O for almost as long as consoles have existed. These "things" can only be legitimately munged after determining that the stdin FILE* in fact derives from a so-called "console host" on Windows.3
There is more work to be done, namely integrating the ‑utf8 processing with a line-editing package or possibly adapting that processing to how some such packages modify the console and input stream settings. This is in-progress.
People who wish to use this version, for either its "nice" pasting behavior with non-ASCII characters or to help expose issues on older Windows systems, are welcome and invited to build the CLI from the repo trunk and report related matters.
When the ‑utf8 option is not provided, it acts just as previous CLI versions have with respect to console interaction. This makes it backward-compatible and safe. (The only incompatibility is that, previously, a ‑utf8 option would produce an error message and exit. Now, it either does something useful on Windows or is ignored on other systems.)
Here is a screen scrape of the present code in action on Windows Terminal with Consolas font:
> sqlite3 -utf8
SQLite version 3.42.0 2023-04-11 19:38:47
Enter ".help" for usage hints.
sqlite> SELECT '$ ¢ £ ¤ ¥ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ €';
$ ¢ £ ¤ ¥ ₠ ₡ ₢ ₣ ₤ ₥ ₦ ₧ ₨ ₩ ₪ ₫ €
sqlite> select '£10.72' as price;
£10.72
sqlite> select hex('平仮名');
E5B9B3E4BBAEE5908D
sqlite> .q
[C:\Work\Projects\Sqlite\OrgDevRepo\SqliteLib\CliUtf8]
>
, where the stuff rightward of prompts was pasted into the console.
A pseudo-screen-scrape4, taken from a cmd.exe session on conhost.exe with Lucida Console font, with a query copy/pasted from the previous example:
sqlite> select hex('□ □ □ ');
E5B9B3E4BBAEE5908D
.
Back to Windows Terminal/Consolas font with a true-screen-scrape5 taken from previous example screen:
select hex('平仮名');
E5B9B3E4BBAEE5908D
sqlite> # Edit of above follows.
sqlite> select unhex(hex('平仮名'));
平仮名
sqlite> .q
.
What the above two scrapes show is that even when not rendered properly, the underlying console buffer content is recovered accurately via select/copy and taken as input properly by a "sqlite3 -utf8" session.
The rendering strangeness on the legacy console makes it less useful for reading, but the correct scrape/paste/input attained with the -utf8 option can still be useful. However, with Windows Terminal as the console and appropriate font selection, the additional benefit of proper rendering on the display is attained.6
- ^ On older Windows versions incapable of supporting code page CP_UTF8, the ‑utf8 option is ignored except for a notice to stderr.
- ^ Whether characters are rendered sensibly is a matter of font selection and use of a modern console host such as Windows Terminal or possibly others that supplant the legacy console host seen as "conhost.exe" in process lists.
- ^ This is not the first such trickery; the partially red sign-on message from sqlite3.exe is another.
- ^ This is not an unaltered copy from the screen. Instead, what were rendered as box-like characters where replaced with U25A1, "White Square". Otherwise, when pasted into the brower's text entry box, the characters render as the Japanese hieroglyphs that were pasted into sqlite3 as the query.
- ^ A "true-screen-scrape" is directly copied from the screen and pasted here, without alteration.
- ^ This begins to approach parity with systems that handle UTF-8 natively in their consoles, without requiring substantial accommodation from the hosted applications. Such accommodation greatly complicates integration of line-editing which is needed to approximate parity at the console.
(2.1) By jose isaias cabrera (jicman) on 2023-04-13 15:05:42 edited from 2.0 in reply to 1.0 [link] [source]
What this code does is to allow a new CLI invocation option, "-utf8", which can have some effect on Windows and is silently ignored (and not advertised) on other platforms.
I have Windows 10 and this is the command prompt version that display at the top:
Microsoft Windows [Version 10.0.19044.2604]
(c) Microsoft Corporation. All rights reserved.
On my case, this only works if the code page is set for 65001, but the display on the command prompt is garbage, but for copy and paste, it works. This is a copy from the DOS screen, and pasted into the post, after execution:
c:\P\bin>chcp 65001
Active code page: 65001
c:\P\bin>sqlite3 -utf8
-- Loading resources from C:\Users\E608313/.sqliterc
SQLite version 3.42.0 2023-04-12 17:54:52
Enter ".help" for usage hints.
sqlite> .ver
SQLite 3.42.0 2023-04-12 17:54:52 824382393d92d9eb6df8701de7c263280150569a708759c4a539acc6d8d38821
gcc-11.3.0
sqlite> sqlite>
sqlite> sqlite>
sqlite> sqlite> select '슷' as UTF8Char;
Ŀ
UTF8Char
Ĵ
'슷'
VM-steps: 5
Run Time: real 0.032 user 0.000000 sys 0.015625
sqlite> sqlite> select '平仮名' as UTF8JP;
Ŀ
UTF8JP
Ĵ
'平仮名'
VM-steps: 5
Run Time: real 0.000 user 0.000000 sys 0.000000
sqlite> sqlite> select '·' as UTF8MDot;
Ŀ
UTF8MDot
Ĵ
'·'
VM-steps: 5
Run Time: real 0.014 user 0.000000 sys 0.000000
sqlite> sqlite> select 'áéíóúñ' AS UTF8SP;
Ŀ
UTF8SP
Ĵ
'áéíóúñ'
VM-steps: 5
Run Time: real 0.015 user 0.000000 sys 0.000000
sqlite> sqlite>
I built it with Cygwin, and these are some things I noticed:
1. your SQlite version date is not the same as mine
2. there is a double sqlite> prompt
3. extra characters appearing: "Ŀ" and "Ĵ".
4. what you see above is not what I see on the screen. The nice UTF8 characters you see above are not what I see on the screen.
This is the output when leaving the code page default...
Microsoft Windows [Version 10.0.19044.2604]
(c) Microsoft Corporation. All rights reserved.
C:\Users\E608313>cd c:\p\bin
c:\P\bin>chcp
Active code page: 437
c:\P\bin>sqlite3 -utf8
-- Loading resources from C:\Users\E608313/.sqliterc
SQLite version 3.42.0 2023-04-12 17:54:52
Enter ".help" for usage hints.
sqlite>
sqlite> sqlite> select '슷' as UTF8Char;
┌──────────┐
│ UTF8Char │
├──────────┤
│ '∞è╖' │
└──────────┘
VM-steps: 5
Run Time: real 0.031 user 0.000000 sys 0.000000
sqlite> sqlite> select '平仮名' as UTF8JP;
┌─────────────┐
│ UTF8JP │
├─────────────┤
│ 'σ╣│Σ╗«σÉì' │
└─────────────┘
VM-steps: 5
Run Time: real 0.016 user 0.000000 sys 0.000000
sqlite> sqlite> select '·' as UTF8MDot;
┌──────────┐
│ UTF8MDot │
├──────────┤
│ '┬╖' │
└──────────┘
VM-steps: 5
Run Time: real 0.000 user 0.000000 sys 0.000000
sqlite> sqlite> select 'áéíóúñ' AS UTF8SP;
┌────────────────┐
│ UTF8SP │
├────────────────┤
│ '├í├⌐├¡├│├║├▒' │
└────────────────┘
VM-steps: 5
Run Time: real 0.000 user 0.000000 sys 0.000000
sqlite> sqlite>
Just an FYI. Perhaps the mode box and qbox can be fixed for code page 65001. Just a thought...
(3) By Larry Brasfield (larrybr) on 2023-04-13 15:14:26 in reply to 2.0 [link] [source]
Thanks, Jose, for taking a look and carefully reporting back.
Responding to selected points in order raised:
I will look into how the legacy conhost.exe responds to processes having set the code page. I gather that your default system code page is not 65001. (My experiments have mostly been with it set to 65001, but much more testing remains.)
Pending investigation, I suspect that the legacy conhost.exe is never going to do rendering reliably nicely for UTF-8. My effort will be to find the contour of its misbehavior so that expectations can be set accordingly.
your SQlite version date is not the same as mine
I probably ran that example before the check-in.
there is a double sqlite> prompt
I'm not seeing that. Can you clarify?
extra characters appearing: "Ŀ" and "Ĵ"
It appears that your .sqliterc sets box mode. And that mode is causing strange box-drawing rendering at times. The cli-utf8 branch has not touched output other than to set the code page, so I will look into how box characters are rendered by conhost.exe after its session code page is changed. (This is a new issue, of the kind I was hoping to see exposed early.)
what you see above is not what I see on the screen.
Among the issues I often see (whenever looking for them) with both conhost.exe and with legacy code pages is that screen select/copy and pasting work "strangely" for non-ASCII/non-8-bit characters. (By that, I mean they conform poorly to the simple model that pertains for ASCII.)
Perhaps the mode box and qbox can be fixed for code page 65001.
Perhaps, but that will be a separate mini-project. Presently, box mode has a simple notion of what horizontal space is occupied by each "character". I think it may be taking the notion of "fixed-width font", normally used for consoles, as some kind of useful truth.
I will do what can be reasonably done to get conhost.exe sessions to act better, but I believe that will be a lost cause to a significant extent. The "Windows Terminal" console host is much better behaved.
With the conhost.exe console, it seems to get the right pasted input to the CLI even when that cannot be rendered properly. And with the Windows Terminal console, the right input is captured and rendered properly.
Thanks again for your report.
BTW, today's check-in on the cli-utf8 branch eliminates the funkiness with longer interactive input lines.
(4) By jose isaias cabrera (jicman) on 2023-04-13 15:33:21 in reply to 3 [link] [source]
BTW, today's check-in on the cli-utf8 branch eliminates the funkiness with longer interactive input lines.
Let me build that one, and re-run. Also, I may need to reboot to, perhaps, get rid of the junk reported.
(5.1) By jose isaias cabrera (jicman) on 2023-04-13 18:57:20 edited from 5.0 in reply to 4 [link] [source]
Let me build that one, and re-run. Also, I may need to reboot to, perhaps, get rid of the junk reported.
So, after a reboot, I built b4fa233d3dda54fa and here are the results:
Versions...
c:\P\bin>ver
Microsoft Windows [Version 10.0.19044.2728]
sqlite> .ver
SQLite 3.42.0 2023-04-13 14:14:27 b4fa233d3dda54fa83771844cf5156bf1275c687925340af17a7713a9400dfef
gcc-11.3.0
1. opening sqlite3 with the -utf8 option creates a double sqlite> sqlite> prompt entry after the first ENTER key. ie:
c:\P\bin>sqlite3 -utf8
-- Loading resources from C:\Users\E608313/.sqliterc
SQLite version 3.42.0 2023-04-13 14:14:27
Enter ".help" for usage hints.
sqlite>
sqlite> sqlite>
sqlite> sqlite>
sqlite> sqlite>
sqlite> sqlite>
By the way, starting the CLI without -utf8 option does not create that double prompt. I also tested this without the .sqliterc file, as well as just using some of the entries in it. Same result for each test. This is the content of my .sqliterc:
.headers on
.timer on
.mode box
.mode qbox
.stats vmstep
2. I changed the DOS prompt font from Consolas to NSimSun and now I can see all the characters correctly. The box and qbox options are still not visible. These are my testing strings:
select '슷' as UTF8Char;
select '平仮名' as UTF8JP;
select '·' as UTF8MDot;
select 'áéíóúñ' AS UTF8SP;
This is the result:
c:\P\bin>sqlite3 -utf8
-- Loading resources from C:\Users\E608313/.sqliterc
SQLite version 3.42.0 2023-04-13 14:14:27
Enter ".help" for usage hints.
sqlite>
sqlite> sqlite> select '슷' as UTF8Char;
Ŀ
UTF8Char
Ĵ
'슷'
VM-steps: 5
Run Time: real 0.000 user 0.000000 sys 0.000000
sqlite> sqlite> select '平仮名' as UTF8JP;
Ŀ
UTF8JP
Ĵ
'平仮名'
VM-steps: 5
Run Time: real 0.015 user 0.000000 sys 0.000000
sqlite> sqlite> select '·' as UTF8MDot;
Ŀ
UTF8MDot
Ĵ
'·'
VM-steps: 5
Run Time: real 0.000 user 0.000000 sys 0.000000
sqlite> sqlite> select 'áéíóúñ' AS UTF8SP;
Ŀ
UTF8SP
Ĵ
'áéíóúñ'
VM-steps: 5
Run Time: real 0.016 user 0.000000 sys 0.000000
sqlite> sqlite>
3. One major problem found was that starting the sqlite3 without the -utf8 option and pasting this string,
select 'áéíóúñ' AS UTF8SP;
Will go into an infinite loop. By the way, do not hit ENTER after starting the CLI without -utf8 option. Just paste that string right after starting it. This is what it looks like:
c:\P\bin>sqlite3
-- Loading resources from C:\Users\E608313/.sqliterc
SQLite version 3.42.0 2023-04-13 14:14:27
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> select 'áéíóúñ' AS UTF8SP;
[edited after hitting control-c to get out of the infinite loop]
This is is the text after hitting control-c twice to break out of the infinite loop:
c:\P\bin>sqlite3
-- Loading resources from C:\Users\E608313/.sqliterc
SQLite version 3.42.0 2023-04-13 14:14:27
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> select 'áéíóúñ' AS UTF8SP;
' ...>
' ...> ;
' ...>
Run Time: real 0.000 user 0.000000 sys 0.000000
Parse error: unrecognized token: "'
;"
select ' ;
^--- error here
c:\P\bin>
I am going to build this on my home computer and see if it's the work computer that is causing this infinite loop. It could be one of its many "[company]'s security features".
I hope this helps.
(6) By Larry Brasfield (larrybr) on 2023-04-13 19:23:44 in reply to 5.0 [link] [source]
Just responding to one of several issues, "infinite loop".
select 'áéíóúñ' AS UTF8SP;
Will go into an infinite loop.
This is no more than another way to demonstrate an input misinterpretation issue. The input gathering loop is not endless; the loop in shell.c is awaiting a closing single-quote, and the loop in fgets is awaiting a CR. Neither of these is seen because the 2nd single-quote is lost as the preceding characters are misparsed as if they were UTF-8 when they are not. You can see this by hitting CR, whereupon the continuation prompt appears which shows that a SQL string literal is open.
This is one of the misbehaviors that -utf8 is intended to cure. It exists in quite a few previous releases.
(7) By Larry Brasfield (larrybr) on 2023-04-13 19:31:12 in reply to 5.1 [link] [source]
opening sqlite3 with the -utf8 option creates a double sqlite> sqlite> prompt entry after the first ...
That's a puzzler. It does not happen with the MSVC compiler and C runtime. I'll have to debug this after a gcc build to see what is going on. It promises to be a strange one since the prompting is so simply interleaved with input gathering.
(9) By jose isaias cabrera (jicman) on 2023-04-13 20:01:13 in reply to 7 [link] [source]
This may be the problem. I am using cygwin to build it. It works fine for my need. But, I wanted to give this utf8 fix a shot, since I do work on the windows cli most of the time.
(8) By Larry Brasfield (larrybr) on 2023-04-13 19:37:47 in reply to 5.1 [link] [source]
I changed the DOS prompt font from Consolas to NSimSun and now I can see all the characters correctly.
What I am about to say seems obvious, but will need saying somewhere/somehow in the docs for this feature (assuming it is released.)
The console can only display "correct" glyphs for code points that are mapped to glyphs for the font being used. Many fonts fall short of providing glyphs for the whole set of 17 Unicode 2^16-character code planes. I will not be trying to alleviate that problem.
I will try to make this clear somewhere suitable. Maybe a forum FAQ.
(10) By jose isaias cabrera (jicman) on 2023-04-13 20:03:38 in reply to 8 [link] [source]
Yes, of course. I don't think you need to wirte anything on this. I used to work for a translation company for 12 years, so I should have known better. :-( But, after 5 years out of it, it hides in the sub-conscience. :-)
(11) By Larry Brasfield (larrybr) on 2023-04-13 20:23:43 in reply to 10 [link] [source]
I'm not trying to say or suggest that you should have swallowed your report of differing results with font selection, or that the cause is too obvious to mention the behavior. And I was not lecturing you as being in need of it.
That sort of issue is easily conflated with input mishandling issues. It will need to be addressed, in some manner, so that people using weird character sets but who are maybe a little naïve regarding low-level technical details can successfully use the CLI to help get character I/O issues figured out.
(12) By jose isaias cabrera (jicman) on 2023-04-13 20:39:26 in reply to 11 [link] [source]
I'm not trying to say or suggest that you should have swallowed your report of differing results with font selection, or that the cause is too obvious to mention the behavior. And I was not lecturing you as being in need of it.
I know you're not. I am doing this to myself to pay closer attention next time. :-) I've heard this saying somewhere, "...the truth shall set you free."
(13) By Larry Brasfield (larrybr) on 2023-04-14 20:16:40 in reply to 5.1 [link] [source]
opening sqlite3 with the -utf8 option creates a double sqlite> sqlite> prompt entry after the first ...
This was quite a puzzler. It did not happen with the MSVC compiler and C runtime. The problem was that fgetws() from the gcc-provided runtime library for Windows, after getting one valid input ending with an "Enter" key code, returns twice with no intervening input. This is on a stream set to translate CR/LF to plain LF.
More strangely, my standalone demo of more UTF-8-friendly console input using the same library routines works fine. So one would infer that something done by the CLI to either the console or the stdin stream is responsible. Perhaps so, but after a day of futile delving into what that may be, I decided to just replace it. That way it will work identically across different libraries and their evolutions as they accommodate the somewhat less goofy UTF-8 capability of later OS versions and Windows Terminal progress.
The code at cli-utf8 tip now should not be built to use a line-editing package. (It is bound to fail because those have to dork console mode and streams, and that is unlikely to match the dorking needed for -utf8 right now.) Soon, the build will allow either the -utf8 option or a line-editor but not both. Later, I will see about making -utf8 work with linenoise.
(14) By jose isaias cabrera (jicman) on 2023-04-14 20:49:48 in reply to 13 [link] [source]
This was quite a puzzler.
The interesting piece is that I always build two versions:
1. cygwin's
2. a standalone Windows version without the the need for any cygwin DLLs.
The cygwin's version works flawlessly. The Windows version works just as well, with the list of exceptions that I wrote up. :-) However, I must say that the cygwin mintty terminal handles Unicode very well so there is no need to be changing page codes, etc.
Thanks for the hard work on this. We, Windows users, appreciate the effort.
(15) By jose isaias cabrera (jicman) on 2023-04-14 21:14:26 in reply to 13 [link] [source]
New version 73a5f542 fixes the prompt double entry:
C:\P\bin>sqlite3 -utf8
-- Loading resources from C:\Users\E608313/.sqliterc
SQLite version 3.42.0 2023-04-14 19:56:32
Enter ".help" for usage hints.
sqlite>
sqlite>
sqlite>
sqlite> .ver
SQLite 3.42.0 2023-04-14 19:56:32 73a5f54231e9f6ad8f013df3987ea48c516080f9193ed873b56f982ee75658c2
gcc-11.3.0
sqlite>
(21) By Keith Medcalf (kmedcalf) on 2023-04-17 18:50:03 in reply to 5.1 [link] [source]
I thought you were using CYGWIN, not Native Windows?
Or have you forsaken CYGWIN to use Native Windows for these tests?
(22) By jose isaias cabrera (jicman) on 2023-04-17 19:06:38 in reply to 21 [link] [source]
I thought you were using CYGWIN, not Native Windows?
I am. But, sometimes when I go to other machines that I do not have admin rights, I have to use the DOS version. I always create two versions: cygwin's (make install) and create one for DOS using this command:
$ x86_64-w64-mingw32-gcc shell.c -o sqlite3.exe sqlite3.c
Or have you forsaken CYGWIN to use Native Windows for these tests?
NEVER! I have been using CYGWIN since 1998. When it was managed by Red Hat, or one of its branches.
I tested this utf8 fix because I was waiting for some reports and thought that I had the time to test some of these changes that Larry had mentioned.
(23) By Keith Medcalf (kmedcalf) on 2023-04-17 19:12:33 in reply to 22 [link] [source]
create one for DOS
That does not create an executable that runs on DOS -- it creates a Windows executable.
(16) By Keith Medcalf (kmedcalf) on 2023-04-15 20:02:04 in reply to 3 [link] [source]
Although I have not seen any of the problems of which you speak the one that does occur and which is most annoying is that "from time to time" the output stops "working" and you need to re-initialize the console host (whether the ancient one of the new WIndows Terminal crap) by issuing the command ".system chcp ..." otherwise ".mode [q]box" stops having boxes and starts spewing crappola.
Resetting the codepage fixes it.
(17) By Larry Brasfield (larrybr) on 2023-04-15 23:10:37 in reply to 16 [link] [source]
Is this with today's trunk after check-in 414010d236647728 or the cli-utf8 branch after check-in 543594a7277b12d1?
I assume you invoked with the -utf8 flag. (Please say if not.) However, as I consider how your failure may be occurring, I realize that if you have linked a line-edit library, the -utf8 option will be accepted but does nothing. (I will fix this.) Did your build have line-editing included?
Please confirm or correct my interpretation of "stops 'working'" together with "stops having boxes and starts spewing crappola". Does that mean some single characters are shown where single box drawing characters should be been? Or something more, less or different?
I must say that having the code page setting apparently changing (such that "system chcp 65001" cures the problem) is very strange considering that the shell.c code sets the code page at startup, restores it at exit, and otherwise leaves it alone. This also makes me suspect a line-editing package is active. And is my assumption correct that you were resetting the code page to 65001?
If you did not link a line-editing package to the shell, I would like to know what sort of things you were doing just before box drawing became messed up. I have not seen misbehavior such as you describe, but would like to if possible.
(18) By Florian Balmer (florian.balmer) on 2023-04-16 21:05:55 in reply to 1.2 [link] [source]
Nice!
My comments as a Windows nerd (sorry for the length):
Now that you're calling ReadConsoleW()
directly, the calls to SetConsoleCP()
to set and restore the 8-bit console encoding are obsolete. Making sure the
encoding change is not leaked and persisting on Ctrl+C exit would require a
SetConsoleCtrlHandler()
hook to restore the encoding, with a lock against the
main thread so it won't race on and call SetConsoleCP()
again before it gets
terminated.
My recommendation is to also call WriteConsoleW()
directly, so that the
SetConsoleOutputCP()
calls can also be removed, and no more worries about
setting and restoring the encoding is necessary. Only the function to print
query results needs to use WriteConsoleW()
, other small ASCII-only things like
input prompts, newlines, etc. could still go through fprintf()
or fputs()
--
just make sure to flush the stdio streams before calling WriteConsoleW()
.
This test has no effect. Most C runtime libraries implement isatty()
with
GetFileType()
under the hood, but that's wrong. For example, FILE_TYPE_CHAR
is also returned for the NUL device (Windows NT object \Device\Null
), not only
for console handles. To determine whether any IO handle is suitable for
Read/WriteConsole()
just ask GetConsoleMode()
:
DWORD dwConsoleMode;
if( GetConsoleMode(hAnyIOHandle,&dwConsoleMode) ){
// Yes, it's a console or pseudo-console handle ...
}
In this thread, there's some confusion about the term "conhost(.exe)-based", because the new Windows Terminal implements its own modern version of conhost.exe. So Legacy Console vs. New Windows Terminal is more precise, for example.
Also in this thread, two variants of mojibake are mixed: wrong encoding conversion vs. missing font glyph. The Legacy Console has had full Unicode support for more than 30 years, but requires a font that has the required glyphs. But even if only replacement characters are displayed, copy/pasting to/from the console is still possible, in this case. The New Windows Terminal uses Direct2D/DirectWrite-based technology, which provides better font linking (to pick missing glyphs from other fonts) than the GDI-based Legacy Console. With an encoding conversion error, the original text usually can't be recovered, or at least not in a lossless manner.
In my opinion, ReadConsoleW()
can be considered a line-editing library: you
can use the Arrow, Home and End keys to navigate and select text, pick history
entries by F7, or perform incremental history search by F8 -- quite handy!
Last but not least, another nice feature about ReadConsoleW()
: if the returned
number of characters read is 0
and GetLastError()
returns
ERROR_OPERATION_ABORTED
, the user has pressed Ctrl+C. (This feature is
undocumented, but it works on any version of Windows back to Windows XP.) So
it's possible to disable the rigorous Ctrl+C handler of the SQLite shell while
waiting for console input, and have Ctrl+C just clear the current unfinished
multi-line query and restart with a new query at a blank prompt, for example. I
always hit Ctrl+C to cancel unfinished input -- and then the SQLite shell throws
me out.
(19) By Larry Brasfield (larrybr) on 2023-04-16 22:46:44 in reply to 18 [link] [source]
Thanks, Florian, for your thoughts and suggestions.
I will look at avoiding SetConsoleX() calls. Anything that reduces use of Windows APIs in this evolving console realm is to the good, IMO. I'm using atexit() to arrange for the settings to be undone, which is effective for ctrl-C exits, so I'm not too worried about leaks. What worry I have is about crash exits, but those are not tolerated for long.
Regarding revision of console output: I think this is a good idea, but output works reasonably well already. It is already funneled through a replacement for fprintf(), so this is easy to do from an implementation perspective. My approach to this has been very conservative, intentionally. Without the -utf8 option, the same system calls happen as before. So we have a good fallback position should problems be discovered: Don't use the option. I would do the same thing with output, at first.
After gaining experience, on multiple Windows operating systems1, we may remove the "legacy" code for Windows console I/O.
On "This test has no effect.": I recognize that there may be redundancy, for the reason you say. I also like the alternative validation of "consoleness" for the standard input stream that you suggest. The point is to scrupulously avoid attempting console setup or I/O with something that is not, in fact, a console.
On terminology confusion: In process listings, I see "conhost.exe associated only with processes that are either implementing their own console or running under Windows Terminal. I do not see the term as confusing and prefer it because it keys off of what other non-console-experts can see. I agree that, among people who often think about, read about or discuss these things, other terms are more precise. Those people are not the audience I target.
On "two variants of mojibake are mixed": I hope that I have not contributed to such confusion. I have not suffered it to my knowledge.
On ReadConsoleW() as a kind of line-editor: Yes, that's nice, at least nicer than echoing editing attempt inputs in a useless fashion. Nevertheless, I intend to get a respectable line editor with history and completions integrated to play nicely when -utf8 mode is in effect.
On ctrl-C handling: Whatever is done along those lines, we will want to be able to interrupt long-running queries and return to the REPL or exit when that interrupt is repeated. I intend to tackle that separately, but intend now to not do anything with console I/O that is inconsistent with that eventual fixup.
- ^ I tease Richard about his "dusty old computers".
(20) By Keith Medcalf (kmedcalf) on 2023-04-17 18:43:51 in reply to 19 [link] [source]
Now that this option is included, it is dead easy to emulate what I mean by "spews crappola". One merely has to set -utf8 and type using one of the box modes. Once again, resetting the code page to 437 fixes the issue.
>sqlite3 -init nul -utf8
-- Loading resources from nul
SQLite version 3.42.0 2023-04-17 18:19:45
Enter ".help" for usage hints.
sqlite> select value from generate_series(1,5);
1
2
3
4
5
sqlite> .mode box
sqlite> select value from generate_series(1,5);
�������Ŀ
� value �
�������Ĵ
� 1 �
� 2 �
� 3 �
� 4 �
� 5 �
���������
sqlite> .system chcp 437
Active code page: 437
sqlite> select value from generate_series(1,5);
┌───────┐
│ value │
├───────┤
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
└───────┘
sqlite> .exit
Of course, I do not actually know what the underlying cause could be. I have removed all the extensions that do not belog and moved them to where they belong, though that should not cause this.
That is the "WIndows Terminal" garbage. The original conhost behaves the same but differently:
>sqlite3 -init nul -utf8
-- Loading resources from nul
SQLite version 3.42.0 2023-04-17 18:19:45
Enter ".help" for usage hints.
sqlite> select value from generate_series(1,5);
1
2
3
4
5
sqlite> .mode box
sqlite> select value from generate_series(1,5);
Ŀ
value
Ĵ
1
2
3
4
5
sqlite> .system chcp
Active code page: 65001
sqlite> .system chcp 437
Active code page: 437
sqlite> select value from generate_series(1,5);
┌───────┐
│ value │
├───────┤
│ 1 │
│ 2 │
│ 3 │
│ 4 │
│ 5 │
└───────┘
sqlite>
(24) By jose isaias cabrera (jicman) on 2023-04-17 19:14:23 in reply to 20 [link] [source]
The problem, I think, is that those boxes are created by these symbols: ┌, │, ┘, ┐, └ and ─. These may not be correctly transposed/represented in the code page 65001.
(25) By jose isaias cabrera (jicman) on 2023-04-17 19:22:21 in reply to 24 [link] [source]
By the way, in case you need the codes:
┌ = ┌
│ = │
┘ = ┘
┐ = ┐
└ = └
─ = ─
(27) By Keith Medcalf (kmedcalf) on 2023-04-17 19:41:21 in reply to 24 [link] [source]
So you are saying that cp 65001 is not in actual fact a "UTF-8" codepage, but some other animal entirely. This would fit with normal Microsoft behaviour. Make a Cone of Cyanide that looks all pretty and sell it as "Ice Cream" to those who do not know any better. Very shortly the whole world will believe that Ice Cream tastes like almonds and causes death in short order.
And no amount of evidence that the "Ice Cream" is in fact Cyanide will make any difference -- that is the Microsoftee way.
(28) By Tim Streater (Clothears) on 2023-04-17 20:01:45 in reply to 27 [link] [source]
WIkipedia says that 65001 is UTF8, and that 437 is the original IBM PC character set (8 bit) which includes those mode box characters. (FWIW).
(30) By Larry Brasfield (larrybr) on 2023-04-17 20:14:10 in reply to 28 [link] [source]
I am pretty sure that code page 65001 has box drawing characters. The issue Keith has exposed goes beyond simple font gaps.
(29.1) By Larry Brasfield (larrybr) on 2023-04-17 23:49:36 edited from 29.0 in reply to 20 [link] [source]
(Edited a footnote for accuracy, mention repro and fix.)
It appears from your post #20 that we can change "from time to time" to "consistently", and concentrate on non-intermittent issues. That's a relief.
What you report in post #20 1st session appears to be a font issue1, while the 2nd session appears to show a misinterpretation issue.
On my Win10 system, with the system default code page as 650012, the Consolas font selected3, I get an opposite result. Boxes are drawn correctly until I enter ".system chcp 437", then drawn with peculiar characters4 until I enter ".system chcp 65001".
Besides this reversal between alright/wonky, I do not see what appears in your 2nd session ("original conhost") 2nd query where, instead of the horizontal separators being shown with enough characters to span the header "value", only a single character is shown. I see enough to span the header in all cases. To me, this suggests a multi-byte misinterpretation issue.
However, with the system default code page set to 437, I see nearly the same symptoms that you report. (I attribute the difference to font selection.)
Looking at the present console output code, it appears that these variations are due to a final conversion using CP_OEMCP which is not in effect with the -utf8 option active. I believe the right cure is to provide an alternate output path for -utf8 operation that understands that CP_OEMCP is not relevant. (I think I may have disguised this issue by doing some tests and not others on a system with the system-wide code page set to 65001. (Added via edit:) This is apparently what happened. With the system default code page (selected by locale setup) set to 437, and the check-in on cli-utf8 branch, the behavior is as intended regardless of the system default code page.)
I am further tempted to restore code page (and any other console flag) settings on return from a .system command. Must ponder upon that.
Thanks for clarifying your report.
- ^ The black diamond enclosing a white question mark indicates this, although this does not rule out a byte-sequence misinterpretation issue which causes MBCS codes interpreted as UTF-8 sequences to map to no glyph.
- ^ This is set via Control Panel / Region applet / Administrative tab / Change system locale / Use Unicode UTF-8 checked.
- ^ Font picked via the console "Properties" menu for the legacy console host and via Settings/(cmd.exe app)/Additional Settings/Appearance in Windows Terminal.
- ^ The particular peculiarity is a matter of font selection but this matters little because the use of code page 65001 induces use of a much wider character set where, for most fonts, the box-drawing glyphs are present.
(31.2) By Larry Brasfield (larrybr) on 2023-04-18 00:03:13 edited from 31.1 in reply to 20 [link] [source]
(Restated via edit from an ask to an invitation to confirm or correct.)
I believe that the current tip of the cli-uft8 branch, check-in hash cc1d4296d71ee6e2, or trunk tip as of check-in 25edf6089724bf9f, cures the misbehavior you reported. If you have the opportunity, please confirm this or say otherwise.
This will cause wonky box output after your
".system chcp 437" trick. That, too, would confirm my diagnosis of the bug you reported.
Thanks again for reporting the misbehavior.
(33) By Keith Medcalf (kmedcalf) on 2023-04-18 01:07:10 in reply to 31.2 [link] [source]
Yes, that solved that problem and the output characters now appear to render correctly (and changing the code page disrupts the output).
(26) By Florian Balmer (florian.balmer) on 2023-04-17 19:35:00 in reply to 19 [link] [source]
I'm using atexit() to arrange for the settings to be undone, which is effective for ctrl-C exits ...
Maybe runtime libraries differ here, but from the docs and sample programs for MSVC, I think at least this implementation doesn't handle Ctrl+C exits. But that's probably irrelevant, soon.
Another minor thing:
WideCharToMultiByte()
is called with the WC_COMPOSITECHECK
and
WC_DEFAULTCHAR
flags. The SQLite library doesn't specify these flags for its
sqlite3_win32_unicode_to_utf8()
and family, so theoretically this might
produce different results for the "same" SQL query depending on whether the
query is passed on the command-line or by copy-paste.
(32) By Larry Brasfield (larrybr) on 2023-04-17 21:37:49 in reply to 26 [link] [source]
Maybe runtime libraries differ here, but from the docs and sample programs for MSVC, I think at least this implementation doesn't handle Ctrl+C exits.
I cannot see how you draw that conclusion from those facts. My reading of the same docs (and code for what that's worth1), and my own experiments show that the registered functions are called for any non-crash exit. Have you seen otherwise?
WideCharToMultiByte() is called with the WC_COMPOSITECHECK and WC_DEFAULTCHAR flags. The SQLite library doesn't specify these flags ...
That's a good observation. I will probably remove the WC_COMPOSITECHECK flag. I will have to think further on the WC_DEFAULTCHAR flag. I'm not sure that one ought not be used for sqlite3_win32_unicode_to_utf8().
- ^ The sample code is a use of atexit() and does not purport to show its implementation.
(34) By Florian Balmer (florian.balmer) on 2023-04-18 10:15:38 in reply to 32 [link] [source]
I cannot see how you draw that conclusion from those facts.
The Remarks section reads "... called when the program terminates normally ...",
and I know that the default Ctrl+C handler just calls ExitProcess()
on a
separate thread, which is not normal termination on Windows. Also see the
Remarks section about SetConsoleCtrlHandler()
.
Moreover, I've tested the program from the Example section, with an additional
Sleep()
call, or here in a simplified form:
> more < sample.c
#include <windows.h>
#include <stdlib.h>
#include <stdio.h>
void goodbye(void){
printf("Goodbye!");
}
main(){
atexit(goodbye);
Sleep(5000);
}
> cl /nologo sample.c
sample.c
> sample.exe
^C
>
If you press Ctrl+C within 5 seconds, the atexit()
code won't be run.
But as soon as you've switched to WriteConsoleW()
, all worrying about setting
the proper encoding and restoring it after .system
commands and on program
exit is gone.
Also have alook at this patch, it achieves Unicode output by only 7 lines
of change to the utf8_printf()
function. The only thing I'd change about this
patch is to ask GetConsoleMode()
whether or not the output handle is really a
console handle.
Re: flags for WideCharToMultiByte()
: I think consistency is most important.
I'm not sure SQLite performs Unicode normalization, so results may be different
for precomposed vs. separate characters (different Unicode normalization).
Why not use the sqlite_XXX_to_YYY()
functions in your code?
(35) By Florian Balmer (florian.balmer) on 2023-04-19 20:24:04 in reply to 34 [link] [source]
I overlooked that sqlite3.exe has a SetConsoleCtrlHandler()
handler that calls
exit()
, which executes the atexit()
code even if the main()
function
doesn't return.
(I found this while testing the new Ctrl+C behavior, which is a nice improvement!)