Comments on the "win-utf8-io-split" branch

(1) By Florian Balmer (florian.balmer) on 2023-10-29 12:03:05 [link] [source]

Registry keys should be closed by RegCloseKey() instead of RegUnLoadKey(). The latter function is intended for backup and restore.
Querying system information from registry keys is generally discouraged by Microsoft, and the appropriate API calls should be used. For OS version info, there's GetVersionEx() and it's variants based on VerifyVersionInfo(). (Note that GetVersionEx() is marked as deprecated, but they can't remove such a frequently used Windows API function without breaking most programs.)
IsValidCodePage() indeed doesn't say anything about whether or not the console supports UTF-8, but only whether WideCharToMultiByte() and MultiByteToWideChar() can convert to/from the specified encoding. Instead, you can nowadays safely assume that any version of the Windows console is capable of Unicode input and output through UTF-16 (back to Windows NT 3.51).
Although your approach to console UTF-8 input and output for the SQLite shell works fine, note that you have way simpler options similar to this patch: use GetConsoleMode() to check whether any input or output handle is really a console, and then only call ReadConsoleW() and WriteConsoleW() to read and write Unicode text--works on any version of Windows, not just on Windows 10 and above.

(2.1) By Larry Brasfield (larrybr) on 2023-10-29 20:50:52 edited from 2.0 in reply to 1 [link] [source]

Thanks for your comments.

Registry keys should be closed ...

True.

[On how to get version info]

To my knowledge, improved after hours of reading on the topic, there is no way, by using GetVersion() and related APIs, to actually get the OS major version number via API(s) exposed for that purpose. GetVersion() and GetVersionEx(), as of Windows 8.1, fib about the OS version. Without a manifest, applications using the API are told it's Windows 8. With a manifest, they are told whatever "the application is manifested for".

VerifyVersionInfo() is similarly impaired. The use of it dressed as IsWindows10OrGreater() returns false unless the application is manifested.

I am looking into embedding an application manifest into the CLI built for Windows. But I much prefer a simpler alternative.

Regarding use of the registry "discouraged by Microsoft": Yes, and they have a lot to say about relying on the actual OS version too, including admissions that they impaired the version query APIs because application developers sometimes misuse the information, (in their opinion.)

IsValidCodePage() indeed doesn't say ... but ...

That's a useful insight. I am now relying on SetConsoleCP() and SetConsoleOutputCP() to reveal console incapability.
(Added via edit:) Sadly, those APIs also fib about success when passing CP_UTF8 results in misbehavior. If you know of a runtime, programmatic test for console UTF-8 capability (and share it), that would be a very welcome tip.

... [UTF-8 I/O] ... note that you have way simpler options ...

I will be experimenting to see how well those alternatives actually work. I am in the process of getting some ancient Windows machines setup here to support that work.

(3) By Florian Balmer (florian.balmer) on 2023-10-29 21:36:51 in reply to 2.1 [link] [source]

With a manifest, they are told whatever "the application is manifested for".

I think the version specified in the manifest is the "maximum" version number that will be reported (and simulated) for an application. I recommend adding an application manifest to sqlite3.exe, something simple along the lines of the Fossil application manifest. (This whole block can be dropped if no UI functions from user32.dll are called; OpenSSL linked with Fossil may do this.)

The YORI shell and utilities package also relies on GetVersionEx(), falls back to GetVersion() on older versions of Windows, and looks directly in the PEB for cases where it needs more detail. YORI is developed by a Microsoft programmer who's really familiar to the Windows internals.

If you know of a runtime, programmatic test for console UTF-8 capability (and share it), that would be a very welcome tip.

I'm sorry I don't. Here's an alternative approach, mostly similar to your solution, but still more complicated than using Read/WriteConsoleW():

Some sanity for C and C++ development on Windows

I will be experimenting to see how well those alternatives actually work. I am in the process of getting some ancient Windows machines setup here to support that work.

For any Win32 I/O handle where GetConsoleMode() succeeds, you can just convert from internal UTF-8 to UTF-16 (or vice versa) and use Read/WriteConsoleW(), which are available on any version of Windows NT.

(4) By Larry Brasfield (larrybr) on 2023-10-29 22:34:25 in reply to 3 [link] [source]

This should not be considered a comprehensive reply. Several issues you raise merit a more careful study than can fit into the few hours remaining before the next release. So what follows are just some related observations.

That version hunting code from the YORI shell (which I like) makes me feel a lot better about the relatively short (27 LOC) function just added which gets the needed job done. (I do not claim "better", but anybody perusing the YORI code will see a lot of machination done for the sake of doing it all with calls, and many of those are dynamically bound just as my version grabbing code is.) I notice that the YORI code goes around WIN32 to get at a DLL used by the WIN32 DLLs to help get their work done. To my knowledge, that too is not a "Microsoft supported" or "Microsoft recommended" approach. (This is not a criticism of that code. I understand the problem well enough to know that "clean" solutions are a fantasy, unless the manifest does more wonders than I perceive yet.)

As I mentioned, I will be looking very carefully at avoiding going through the MumbleA() APIs and favoring the MumbleW() APIs. I hesitate to leap ahead on that path before ascertaining whether WIN32S platforms must be supported.

That "Some sanity ..." article is quite interesting. It confirms much of what I have learned and adds a few nuggets to that bolus.

There is one problem which I doubt is solved by sticking to the MumbleW() console I/O APIs. That is the treatment of pastes into the console of UTF-8 content. Getting that to work right was a big part of my focus when I first implemented the -utf8 option. It will be interesting to see how that fares when reading UTF-16 from the console.

(5) By Florian Balmer (florian.balmer) on 2023-10-30 19:30:00 in reply to 4 [link] [source]

... calls, and many of those are dynamically bound ...

The dynamic linking to GetVersionEx() is because YORI also supports Windows versions that only have the older GetVersion(), that's Windows NT 3.1 and Windows NT 3.51 (IIRC the "Ex"-function was introduced with Windows NT 4.0).

... treatment of pastes into the console of UTF-8 content ...

Yes, that works!

Your new runtime test to check console UTF-8 capabilities is smart, but fragile:

Tracking the amount of horizontal cursor movement after text output, i.e. how many text cells are occupied by the printed text, depends on the console font, and whether all required glyphs are installed for the selected font.

Consider this sample program (compiled to sample.exe by cl sample.c):

#include <windows.h>
#include <stdio.h>
main(){
  DWORD dwNumberOfCharsWritten;
  CONSOLE_SCREEN_BUFFER_INFO csbi;
  WriteConsoleA(
    GetStdHandle(STD_OUTPUT_HANDLE),
    "\xC8\xAB",
    2,
    &dwNumberOfCharsWritten,
    NULL);
  GetConsoleScreenBufferInfo(
    GetStdHandle(STD_OUTPUT_HANDLE),
    &csbi);
  printf(
    "\nCONSOLE_SCREEN_BUFFER_INFO.dwCursorPosition.X = %u\n",
    csbi.dwCursorPosition.X);
}

Now run it on a machine and simulate support for console UTF-8 I/O:

>chcp 65001 & sample
Active code page: 65001
ȫ
CONSOLE_SCREEN_BUFFER_INFO.dwCursorPosition.X = 1

Works as expected! Next, run it on a machine that doesn't support UTF-8 mode, but instead remains in the default OEM-US (437) mode:

>chcp 437 & sample
Active code page: 437
╚½
CONSOLE_SCREEN_BUFFER_INFO.dwCursorPosition.X = 2

So far, so good, easy to tell apart. What about a machine with ANSI/OEM Japanese (932) as the default, and lacking UTF-8 mode support?

>chcp 932 & sample
Active code page: 932
ﾈｫ
CONSOLE_SCREEN_BUFFER_INFO.dwCursorPosition.X = 2

It's just one glyph, but it happens to work because the glyph is "wide" and takes up two console text cells. However, this may look different if another console font is chosen.

So try with ANSI/OEM Traditional Chinese (950) as the default:

>chcp 950 & sample
Active code page: 950

CONSOLE_SCREEN_BUFFER_INFO.dwCursorPosition.X = 1

Failed. That's because 0xC8 is a lead byte to start a two-byte sequence in most East Asian DBCS encodings. In this case, it's a code point in the Private Use Area, without a defined width, to mimic the UTF-8 case.

Go with MumbleW()--and you can forget about any such worries!

(6) By Florian Balmer (florian.balmer) on 2023-10-30 20:26:42 in reply to 5 [source]

The example with ANSI/OEM Japanese (932) is flawed, it's two narrow-width characters that should only take up one single text cell, but obviously they still take two.

But let's consider ANSI/OEM Simplified Chinese (936):

>chcp 936 & sample
Active code page: 936
全
CONSOLE_SCREEN_BUFFER_INFO.dwCursorPosition.X = 2

One two-code point character that happens to be wide and occupy two text cells, but with another font, it may only take up a single text cell.

(7.1) By Larry Brasfield (larrybr) on 2023-10-30 20:56:36 edited from 7.0 in reply to 6 [link] [source]

(Responding to posts 5 and 6:)

Your new runtime test to check console UTF-8 capabilities is smart, but fragile: ...

I studied your "failed" example closely. Can you suggest a better crafted trial character or character sequence that avoids the issue of MBCS group sizes matching UTF-8 group sizes? The characters chosen should require more glyphs when decoded from the byte stream using code pages other than CP_UTF8, then rendered, than when decoded as UTF-8, then rendered.

In the meanwhile, I will investigate to see if there is a systematic way to avoid the problem while still detecting absence of UTF-8 console I/O capability at runtime.

The failure in question, to be manifested, is going to require a combination of a pre-Win10 OS, use of a truly MBCS code page, use of the CLI (without the -no-utf8 option), and a font with double-wide glyphs. (The case of fonts with missing glyphs is not too concerning; the result will be one form of gibberish in lieu of another.) I do not see this as meriting a delay of the pending 3.44 release. Much of the reason for the new UTF-8 console support is to improve usability for those customers using the CLI on Windows, the majority of which we expect to be using Windows 10. If they are really attached to their older platforms, the --no-utf8 option should get them the older rendering.

Please do not take this as a lack of appreciation regarding your remarks.

Go with MumbleW()--and you can forget about any such worries!

I've seen that assertion before. And as I've said before, I intend to test its truthiness. I hope it is true. I would even say it ought to be true. But getting to "it is true" and "it should now be used" is going to take some more work.

(8) By Florian Balmer (florian.balmer) on 2023-10-30 21:18:17 in reply to 7.1 [link] [source]

Can you suggest a better crafted trial character or character sequence that avoids the issue of MBCS group sizes matching UTF-8 group sizes?

I'm sorry I don't have any, right now. Windows supports quite a few legacy code pages, with the most common ones probably listed here:

https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/

You'd need a UTF-8 sequence equal to (or longer) than two bytes, i.e. starting with a lead byte equal to or greater than 0x80.

But for CP-936, any byte greater than 0x80 is a DBCS lead byte, with the potential to produce a single glyph, maybe a CJK ideograph that is wide with most fonts, but I'm not sure.

Also, zero-width code points like ZWSP (U+200B) don't seem to work, either.

I'm sorry, I really don't have any idea, right now.

(9) By Larry Brasfield (larrybr) on 2023-10-30 23:27:05 in reply to 6 [link] [source]

Would you please be so kind as to verify that this check-in is highly likely to avoid the issues you have raised concerning mere glyph counting?

I am busy getting a multitude of code pages installed on an old clunker laptop running Windows 7 thanks to your observations.

(10) By Florian Balmer (florian.balmer) on 2023-10-31 05:17:05 in reply to 9 [link] [source]

Clever trick! I'll give it a try on my various Windows machines tonight!

(11) By Florian Balmer (florian.balmer) on 2023-10-31 17:40:43 in reply to 9 [link] [source]

Sir, my congratulations, you did it!

Successfully tested with:

Windows XP
Windows 7
Windows 10 (new / legacy console mode)
Windows 11 (new / legacy console mode, Windows Terminal)

(Out of curiosity, I wanted to run the test on Windows NT 4.0, but didn't have the tools at hand to set the minimum OS version to 4.0 in the executable image.)

I also re-checked that copy-pasting Unicode text into the console prompt works fine with MumbleW() on all tested systems, using the program below:

/*
**  NOTE: For input cancelled by Ctrl+C, ReadConsoleW() seems to return TRUE,
**  but with dwNumberOfCharsRead set to 0, and GetLastError() returning
**  ERROR_OPERATION_ABORTED.
*/
#include <windows.h>
#include <stdio.h>
void main(){
  DWORD dwConsoleMode;
  HANDLE hStdIn = GetStdHandle(STD_INPUT_HANDLE);
  HANDLE hStdOut = GetStdHandle(STD_OUTPUT_HANDLE);
  if( !GetConsoleMode(hStdIn,&dwConsoleMode) ) return;
  SetConsoleCtrlHandler(NULL,TRUE); // Handled by ReadConsoleW()
  SetConsoleMode(
    hStdIn,
    ENABLE_ECHO_INPUT|ENABLE_LINE_INPUT|ENABLE_PROCESSED_INPUT);
  while( TRUE ){
    WCHAR wchBuf[10] = {0};
    DWORD cchRead = (DWORD)-1;
    DWORD cchWritten;
    BOOL bSuccess;
    bSuccess = ReadConsoleW(
                 hStdIn,wchBuf,ARRAYSIZE(wchBuf)-1,&cchRead,NULL);
    if( bSuccess ){
      WriteConsoleW(hStdOut,wchBuf,lstrlenW(wchBuf),&cchWritten,NULL);
      if( cchRead==0 && GetLastError()==ERROR_OPERATION_ABORTED ){
        break;
      }
    }
    else{
      printf("ReadConsole(): Error=%u\n",GetLastError());
    }
  }
  SetConsoleMode(hStdIn,dwConsoleMode);
}

(12) By Larry Brasfield (larrybr) on 2023-10-31 20:23:03 in reply to 11 [link] [source]

Successfully tested with: ...

Thanks for doing that. Much appreciated.

I also re-checked that copy-pasting Unicode text into the console prompt ...

Well, that's encouraging. By this I mean that it helps justify and motivate doing the work and facing the risk associated with substantial code changes.

I am fully appreciative of the code simplification benefit to be obtained by using the whateverW APIs for doing console I/O on Windows. Such changes should make the CLI work better on older Windows machines, with respect to faithful character I/O at the console and speed (for all that matters.) It might eliminate the need for a capability check.

Something along those lines will likely be done for the successor to the 3.44 release. There are complicating considerations, such as benign or zero interaction with at least one line-editing library that works on Windows as well as other platforms. (This may affect partitioning rather than whateverW() calls.) Another aspect is we want to work well with Windows Terminal as well as the legacy console. Compatibility with SSH console sessions may be important. (This might motivate using UTF-8 for I/O despite the attraction of the ?W() APIs.) We want it to work right when built with commonly used non-Microsoft toolset(s).

Whatever is done will be done carefully, with plenty of review and testing before we commit it to a release. This is why the code you see today appears a bit clunky. It was written when Windows 95 (without the WCHAR APIs) was a supported target. Now that only Windows operating systems in the NT family and WinRT are supported WIN32 targets, we have more lattitude to institute changes.