Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad characters occasionally displayed when writing lots of identical UTF-8 lines #386

Closed
JFLarvoire opened this issue Mar 12, 2019 · 36 comments
Assignees
Labels
Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-2 A description (P2) Product-Conhost For issues in the Console codebase Work-Item It's being tracked by an actual work item internally. (to be removed soon)
Milestone

Comments

@JFLarvoire
Copy link

  • Your Windows build number: (Type ver at a Windows Command Prompt)
    Microsoft Windows [Version 10.0.18342.8]
    (But this happens on all versions of Windows since at least Windows 7.)

  • What you're doing and what's happening: (Copy & paste specific commands and their output, or include screen shots)
    In the cmd.exe console, set to code page 65001, type a large UTF-8 file containing hundreds of short lines with just the characters "ü€ü€ü€ü€ü€"

  • What's wrong / what should be happening instead:
    95% of the lines are displayed correctly, but about 5% contain spurious characters. Ex: "ü���ü€ü€ü€ü€"
    This happens with any application, not just cmd.exe's built-in TYPE command. So I suspect this is a bug in the UTF-8 to Unicode conversion routine in the console output handler.
    For more details, and files and scripts for reproducing it, see the discussion thread on this forum: https://www.dostips.com/forum/viewtopic.php?f=3&t=9017

@zadjii-msft
Copy link
Member

I've filed this internally as MSFT:20812701 to make sure this gets looked at. Thanks for the report!

@zadjii-msft zadjii-msft added Work-Item It's being tracked by an actual work item internally. (to be removed soon) Product-Conhost For issues in the Console codebase Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) labels Mar 12, 2019
@german-one
Copy link
Contributor

german-one commented Mar 12, 2019

Seems there is a buffer in place that causes this bug. If the buffer is filled but the bytes read end somewhere inside of a multi-byte UTF-8 character then the conversion of this buffer end fails and the conversion routine puts as many Unicode replacement characters in the output buffer as invalid bytes were at the end of the buffer. The same will happen with the next chunk that begins with the bytes left from the last multi-byte character of the former chunk. The example above shows three replacement characters which is the number of bytes the € sign takes in a UTF-8 character.

@german-one
Copy link
Contributor

german-one commented Mar 12, 2019

In the hope that this will help to fix the issue I offer a minimalist C code that works around it. The code takes care of character boundaries. (In addition it discards the Byte Order Mark which is a different issue.)

/* NormalizeUTF8.c */

#define WIN32_LEAN_AND_MEAN
#include <windows.h>

// define this macro to either of these values
//  1 UTF-16 using WriteConsoleW (UTF-16 using WriteFile doesn't work at all btw)
//  2 original encoding using WriteConsoleA
//  3 original encoding using WriteFile
#define OUT_PROC 1

// limit of characters to be read at once
#define CHUNK_MAX 256

int main(void)
{
  // (Every 4 in the following decalrations could also have been a 3. I used 4 for a proper buffer alignment.)
  CHAR   a_chunk[CHUNK_MAX + 4]                   = { 0 }; // buffer for the unconverted chunk read
# if (OUT_PROC == 1)
  WCHAR  w_chunk[(CHUNK_MAX + 4) * sizeof(WCHAR)] = { 0 }; // buffer for the converted chunk
# endif
  UCHAR  extended[4]                              = { 0 }; // buffer for additional UTF-8 code units that usually have to be saved for the next chunk
  DWORD  a_length                                 = 0u,    // length of a chunk read
# if (OUT_PROC == 1)
         w_length                                 = 0u,    // length of a UTF-16 chunk
# endif
         extended_length                          = 0u,    // number of additional UTF-8 code units read
         idx                                      = 0u;    // loop index
  UINT   codepage                                 = GetConsoleCP();
  HANDLE h_in                                     = GetStdHandle(STD_INPUT_HANDLE),
         h_out                                    = GetStdHandle(STD_OUTPUT_HANDLE);

  // *** check whether an UTF-8 input stream has a Byte Order Mark ***
  if (codepage == CP_UTF8)
  {
    ReadFile(h_in, extended, 3ul, &extended_length, NULL);
    if (extended[0] == 0xEF && extended[1] == 0xBB && extended[2] == 0xBF)
      extended_length = 0u; // discard the BOM
  }

  // process the data of the standard input in a loop
  for (;;)
  {
    // copy UTF-8 code units that were remaining from the formerly read chunk (if any)
    for (idx = 0u; idx < extended_length; ++idx)
      a_chunk[idx] = (CHAR)extended[idx];

    // try to read data from standard input
    ReadFile(h_in, &a_chunk[extended_length], CHUNK_MAX - extended_length, &a_length, NULL);
    a_length += extended_length;
    extended_length = 0u;

    // quit the iteration if no data was read
    if (a_length == 0u)
      break;

    // *** if necessary, read up to three more bytes until the end of a UTF-8 multi-byte character was reached: ***
    // Was the last byte in the buffer a byte belonging to an UTF-8 multi-byte character?
    if (codepage == CP_UTF8 && (UCHAR)a_chunk[a_length - 1u] >> 7u == 1u)
    {
      for (idx = 0u; idx < 3u; ++idx)
      {
        ReadFile(h_in, extended, 1u, &extended_length, NULL);
        if (extended_length == 1u)
        {
          // Was the byte read an ASCII character or the beginning of the next multi-byte character?
          if (extended[0] >> 7u == 0u || extended[0] >> 6u == 3u)
            break;

          a_chunk[a_length++] = extended[0];
          extended_length = 0u;
        }
        else
          break;
      }
    }

#   if (OUT_PROC == 1)
    // convert the characters read to UTF-16
    w_length = MultiByteToWideChar(codepage, 0u, a_chunk, a_length, w_chunk, (CHUNK_MAX + 4) * sizeof(WCHAR));

    // write the chunk to the console window buffer
    WriteConsoleW(h_out, w_chunk, w_length, &w_length, NULL);
#   elif (OUT_PROC == 2)
    WriteConsoleA(h_out, a_chunk, a_length, &a_length, NULL);
#   elif (OUT_PROC == 3)
    WriteFile(h_out, a_chunk, a_length, &a_length, NULL);
#   endif
  }

  return 0;
}

In a CMD prompt you may use it like that:

chcp 65001
type "utf-8.txt" | NormalizeUTF8.exe

@german-one
Copy link
Contributor

I updated the test code above. Please see my comments related to the OUT_PROC macro.
It seems to prove that the application is the culprit rather than the console. As long as the application takes care of character boundaries the console does process UTF-8 correctly. I get the same results using both the new console and the legacy console.

@ghost ghost added the Needs-Tag-Fix Doesn't match tag requirements label May 17, 2019
@DHowett-MSFT
Copy link
Contributor

@miniksa This is a lot like what we saw with CMD and WSL bringing up emoji for the build demo. Marking as a bug.

@DHowett-MSFT DHowett-MSFT added the Issue-Bug It either shouldn't be doing this or needs an investigation. label May 18, 2019
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label May 18, 2019
@german-one
Copy link
Contributor

german-one commented May 18, 2019

@DHowett-MSFT May I ask you - In case of emojis split UTF-16 surrogates, I guess?

@DHowett-MSFT
Copy link
Contributor

That's possible. There's also a chance that we are getting the 3-byte UTF-8 encoded emoji in two parts from the application.

@german-one
Copy link
Contributor

I'm convinced you guys from the terminal/console team are nice and will work around that kind of faulty input from an application, right? 😃

@DHowett-MSFT
Copy link
Contributor

As best we can!

@miniksa miniksa self-assigned this May 21, 2019
@miniksa
Copy link
Member

miniksa commented May 21, 2019

I'm taking this because it feels similar to some of the other bugs I have assigned to me talking about U+FFFD.

@DHowett-MSFT
Copy link
Contributor

DHowett-MSFT commented May 25, 2019

Even if there's an issue in conhost, I'm convinced there's an issue in Terminal as well.

char* pchStr = (char*)(buffer);
std::string str{pchStr, dwRead};
auto hstr = winrt::to_hstring(str);

We are absolutely cutting UTF-8 sequences in half right there.

We can do fun games and cache partial sequences (partial code below), but it does add complexity.
It's probably necessary complexity!

if ((buffer[iLastGoodByte] & 0b1'0000000) == 0b1'0000000)
{
    // does the encoding fit?
    const auto leadByte = buffer[iLastGoodByte];
    auto sequenceLen = dwRead - iLastGoodByte;
    if ((((leadByte & 0b111'00000) == 0b110'00000) && sequenceLen < 2) ||  // lead reqs 2 bytes
        (((leadByte & 0b1111'0000) == 0b1110'0000) && sequenceLen < 3) ||  // lead reqs 3 bytes
        (((leadByte & 0b11111'000) == 0b11110'000) && sequenceLen < 4))    // lead reqs 4 bytes
    {
        ::memmove_s(utf8Partials, std::extent<decltype(utf8Partials)>::value, buffer + iLastGoodByte, dwRead - iLastGoodByte);
        // ... do the rest ...

I hacked it up to write 0 1 2 3 over the lead bytes where we cut off the ends of partial sequences, and A B C D over the trail bytes where we cut off the beginnings of partial sequences, and we get this:

image

(you can see where one ReadFile ended and the next began where we transitioned from numbers to letters.)

It wouldn't be strictly easier to just ReadFile until we got a complete UTF-8 codepoint because we would still need to figure out when we didn't have a complete codepoint.

@german-one
Copy link
Contributor

@DHowett-MSFT

It wouldn't be strictly easier to just ReadFile until we got a complete UTF-8 codepoint

Sorry for my ignorance but I don't understand this statement. Correct me if I'm wrong but passing a string that contains an incomplete codepoint to any other function causes an additional risk (just like to_hstring() irreversibly corrupts the string in the current code). Wouldn't it be always better to make sure you forward only complete codepoints from the early beginning?

@DHowett-MSFT
Copy link
Contributor

Sorry, that’s what this proposal is! When we encounter an incomplete codepoint we need to wait for the next trip around the I/O loop to complete it. We can still send everything up to but not including the incomplete codepoint.

My statement about ReadFile was moreso that it is safer/easier to cache the partial data and wait for the next read loop instead of trying to expand the buffer and read a couple more bytes immediately.

@DHowett-MSFT
Copy link
Contributor

DHowett-MSFT commented May 25, 2019

This is almost certainly the right place to use @adiviness's Utf8ToWideCharParser!

Module Name:
- utf8ToWideCharParser.hpp
Abstract:
- This transforms a multi-byte character sequence into wide chars
- It will attempt to work around invalid byte sequences
- Partial byte sequences are supported
Author(s):
- Austin Diviness (AustDi) 16-August-2016

To be done:

  • move it from host/ to types/ (check dependencies first)
  • consider: we trust that conpty will send us well-formed sequences, so we do not need the expensive full-buffer check (O(N)) that U8tWCP does; we only need to check the last 3 bytes of the string for longer-than-will-fit sequences
  • U8tWCP doesn't do null termination, but an HSTRING reference requires null termination on construction; we can work around this with a moderately expensive copy (O(N))

@german-one
Copy link
Contributor

german-one commented May 26, 2019

I was about to write a piece of code when I saw you already have a parser. But now that you added some more notes I finished it in the hope it is a little helpful.

// NormalizeUTF8.cpp

#include "pch.h"
#include <string>
#include <algorithm>
#include <type_traits>
#include <windows.h>

bool GetConsoleVTHandles(HANDLE& refHIn, HANDLE& refHOut, HANDLE& refHErr)
{
  HANDLE hIn{ GetStdHandle(STD_INPUT_HANDLE) };
  HANDLE hOut{ GetStdHandle(STD_OUTPUT_HANDLE) };
  HANDLE hErr{ GetStdHandle(STD_ERROR_HANDLE) };
  if (hIn == INVALID_HANDLE_VALUE || hOut == INVALID_HANDLE_VALUE || hErr == INVALID_HANDLE_VALUE)
    return false;

  DWORD dwConsoleMode{};

  if (GetFileType(hIn) == FILE_TYPE_CHAR &&
      (!GetConsoleMode(hIn, &dwConsoleMode) || !SetConsoleMode(hIn, dwConsoleMode | ENABLE_VIRTUAL_TERMINAL_INPUT)))
    return false;

  if (GetFileType(hOut) == FILE_TYPE_CHAR &&
      (!GetConsoleMode(hOut, &dwConsoleMode) || !SetConsoleMode(hOut, dwConsoleMode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | DISABLE_NEWLINE_AUTO_RETURN)))
    return false;

  if (GetFileType(hErr) == FILE_TYPE_CHAR &&
      (!GetConsoleMode(hErr, &dwConsoleMode) || !SetConsoleMode(hErr, dwConsoleMode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | DISABLE_NEWLINE_AUTO_RETURN)))
    return false;

  refHIn = hIn;
  refHOut = hOut;
  refHErr = hErr;
  return true;
}

int main()
{
  HANDLE hIn{ INVALID_HANDLE_VALUE };
  HANDLE hOut{ INVALID_HANDLE_VALUE };
  HANDLE hErr{ INVALID_HANDLE_VALUE };
  if (!GetConsoleVTHandles(hIn, hOut, hErr))
    return 1;

  BYTE buffer[256]{ 0 };     // buffer for the chunk read (small size for debugging)
  BYTE utf8Partials[4]{ 0 }; // buffer for code units of a partial UTF-8 code point that have to be cached
  DWORD dwRead{};            // length of a chunk read
  DWORD dwPartialsLen{};     // number of cached UTF-8 code units
  bool fSuccess{};
  bool fStreamBegin{};       // indicates whether text is left in utf8Partials after the BOM check
  const BYTE bitmasks[]{ 0, 0b11000000, 0b11100000, 0b11110000 }; // for comparisons after the Lead Byte was found

  // *** check whether an UTF-8 input stream has a Byte Order Mark ***
  fSuccess = !!ReadFile(hIn, utf8Partials, 3UL, &dwPartialsLen, nullptr);
  if (!fSuccess)
    return 1;

  if (utf8Partials[0] == 0xEF && utf8Partials[1] == 0xBB && utf8Partials[2] == 0xBF)
    dwPartialsLen = 0; // discard the BOM
  else
    fStreamBegin = true;


  // process the data of the standard input in a loop
  while (true)
  {
    // copy UTF-8 code units that were remaining from the previously read chunk (if any)
    if (dwPartialsLen != 0)
      std::move(utf8Partials, utf8Partials + dwPartialsLen, buffer);

    // try to read data from standard input
    fSuccess = !!ReadFile(hIn, &buffer[dwPartialsLen], std::extent<decltype(buffer)>::value - dwPartialsLen, &dwRead, nullptr);

    dwRead += dwPartialsLen;
    dwPartialsLen = 0;

    if (dwRead == 0) // quit the iteration if no data has been read and no cached data was left
      break;
    else if (!fSuccess && !fStreamBegin) // cached data without bytes for completion in the buffer
      return 1;

    const BYTE* const endPtr{ buffer + dwRead };
    const BYTE* backIter{ endPtr - 1 };
    // *** if necessary put up to three bytes in the cache: ***
    if ((*backIter & 0b10000000) == 0b10000000) // last byte in the buffer was a byte belonging to a UTF-8 multi-byte character
    {
      for (DWORD dwSequenceLen{ 1UL }, stop{ dwRead < 4UL ? dwRead : 4UL }; dwSequenceLen < stop; ++dwSequenceLen, --backIter) // check only up to 3 last bytes, if no Lead Byte was found then the byte before must be the Lead Byte and no partials are in the buffer
      {
        if ((*backIter & 0b11000000) == 0b11000000) // Lead Byte found
        {
          if ((*backIter & bitmasks[dwSequenceLen]) != bitmasks[dwSequenceLen - 1]) // if the Lead Byte indicates that the last bytes in the buffer is a partial UTF-8 code point then cache them
          {
            std::move(backIter, endPtr, utf8Partials);
            dwRead -= dwSequenceLen;
            dwPartialsLen = dwSequenceLen;
          }

          break;
        }
      }
    }

    const std::string_view strView{ reinterpret_cast<char*>(buffer), dwRead };
    auto hstr{ winrt::to_hstring(strView) };

    // write the chunk to the console window buffer (as replacement for _outputHandlers() here)
    DWORD  dwWritten{};
    fSuccess = !!WriteConsoleW(hOut, hstr.c_str(), hstr.size(), &dwWritten, nullptr);
    if (!fSuccess)
      return 1;

    fStreamBegin = false;
  }

  return 0;
}

It also still addresses the BOM issue (blank that appears in the first screen shot). Please tell me if you want me to report it separately.

image
image

Attached you'll find a testing environment with the compiled app.
NormalizeUTF8.zip

EDIT: Code and attachment updated.
EDIT 2: VT Processing enabled for additional tests

@german-one
Copy link
Contributor

I updated the latest test code again (with VT Processing enabled) in order to see how it might be related to the occurrence of � (Unicode Replacement Character U+FFFD) as they appear in #455 and #666.
For that reason I've stolen @PhMajerus ' test pattern, converted it to UTF-8, and attached it to the above ZIP archive.
TYPE only:
image

Piped to the test code:
image

The replacement characters don't appear at the same positions as in the linked issues. But I guess this is due to different sizes of character buffers in the Console and Terminal source codes.

@german-one
Copy link
Contributor

It was not too hard to update the function that @DHowett-MSFT quoted in #386 (comment)
Unfortunately this doesn't seem to be the only function that has to be updated because it isn't sufficient to close this issue yet.
Could you guys point me to more source files that are worth to review? I'd like to contribute but for me it's a mess to figure out where I have a chance to find the culprit.

@miniksa
Copy link
Member

miniksa commented Jul 1, 2019

Uhhhh, what's still wrong after your proposed test change? That would help me better narrow down which files might still be suspect.

If you're looking inside the console host for problems, I'd personally breakpoint on the WriteConsole* methods inside _stream.cpp and chase them around the code to see how the segments of text received off of the APIs are being handled.

(I think you're trying to fix the console host side given the screenshots above of just CMD.exe in a plain conhost window.)

@german-one
Copy link
Contributor

Thanks you @miniksa

The screenshots above have been made before I forked the source code. So, yes that was of course for conhost only. The function that @DHowett-MSFT quoted and that I tried to update is the interface between terminal and conhost (at least that's my poor understanding of how it is linked together, correct me if I'm wrong). You can find it in this commit:
german-one@ea46579

what's still wrong after your proposed test change?

Well, nothing changed yet. I didn't expect that this would already close this issue because the buffer size in this function is 4096 bytes but the U+FFFD character appears more often. I suspect there is a much smaller buffer involved somewhere else. However, the update also includes the BOM removal. And at least that is what I expected to see.

That would help me better narrow down which files might still be suspect.

Some more screenshots might be helpful.

  1. PowerShell did never show that behavior. Neither Windows PowerShell ...
    pswin

... nor PowerShell 6.
ps6

Neither as stand-alone nor in the terminal.
terminal_ps6

  1. Ubuntu bash has a problem with the BOM (as Linux has in general).
    bash
    terminal_ubuntu

  2. CMD fails entirely. For completeness again in the default conhost ...
    conwin

... and with the updated code.
conupdate
terminal_cmd
At least in the last screenshot I would have expected that the BOM doesn't appear anymore (blank on the top left corner of the pattern).
FWIW the misaligned text ".majerus.net/" is not related to my update. Probably another issue.

I'd personally breakpoint on the WriteConsole* methods inside _stream.cpp and chase them around the code to see how the segments of text received off of the APIs are being handled.

I'll try my best.

Steffen

@miniksa
Copy link
Member

miniksa commented Jul 2, 2019

Oh. The difference between Powershell/PWSH, CMD, and WSL is that they're likely all using different Write* method entrypoints.

I forget which is which, but they're probably using a selection of WriteConsoleA, WriteConsoleW, WriteConsoleOutputA, WriteConsoleOutputW, WriteFile (which should be serviced the same as WriteConsoleA).

The post-API-decoding-calls should be findable by searching the code for ApiRoutines::WriteConsoleAImpl and friends.

It's probably that CMD is using one and the others are using others and the one that CMD is using is broken somehow.

The other thing to watch out for is whether they're calling SetConsoleCP or SetConsoleOutputCP. It's likely/probable that Powershell, PWS, and WSL are either setting the codepage to 65001 (UTF-8) and using the A versions of the APIs OR using the W versions of the APIs with UTF-16 text directly.

But CMD.exe doesn't usually do any of that. It's probably using whatever codepage the system is in and writing everything into the A API without thinking about it.

@german-one
Copy link
Contributor

But CMD.exe doesn't usually do any of that. It's probably using whatever codepage the system is in and writing everything into the A API without thinking about it.

Yes I'm aware of that fact. If you look at the prompts in my screenshots you'll see that I always have to use chcp 65001 in the cmd.exe instances. Otherwise you would see the character representation of every single byte in the default OEM codepage.

WriteConsoleAImpl is certainly a good point to look at. It involves the Utf8ToWideCharParser class where I don't expect to find the culprit. The member functions of the class use MultiByteToWideChar to check if all code points are valid and to convert the UTF-8 buffer to UTF-16. All good so far. I expect that the culprit is the caller of WriteConsoleAImpl. My test code proves that UTF-8 itself is not a real problem. But if the caller pass chunks containing partial code points (which is what I correct in my test code) then it might be already too late for the API to work around that issue. Imo we would need a static cache for the partials in WriteConsoleAImpl. But I'm not sure if that would cause more issues. If we have a static cache, would this influence concurrent calls from other processes? Or owns every process an independent instance of the function?

@german-one
Copy link
Contributor

Um ... wait ... we already have a static instance of Utf8ToWideCharParser in WriteConsoleAImpl. That answers my question and I need to have a closer look at Utf8ToWideCharParser.

@miniksa
Copy link
Member

miniksa commented Jul 2, 2019

There cannot be concurrent calls in that we cannot service two calls at the exact same time because they're queued and the console is locked.

However, there can be concurrent calls in that Process A can write a partial fragment, then Process B's request is the next in queue and can write some other unrelated fragment, then Process A is next to write its remainder. So a static cache will leave some sort of hole in the quest for a perfect solution.

However, I'm pretty sure Utf8ToWideCharParser right now ignores this problem and effectively IS a static cache for the partials when the codepage is set to 65001.

There is also another static cache for other codepages called WriteConsoleDbcsLeadByte hanging off of the SCREEN_INFORMATION class that is supposed to hold the first half of a split double-byte sequence (for codepages like 932).

That's been that way for decades and is probably why we ignored the concurrence problem when writing Utf8ToWideCharParser in the more recent era.

The caller of WriteConsoleAImpl can be found in the server library, but that library is basically just decoding the entire API packet as it receives it. It will call WriteConsoleAImpl once with the entire contents of the packet that it received from the calling process. So if anyone's tearing bytes in half, it should be the caller on the other end.

But Utf8ToWideCharParser and WriteConsoleDbcsLeadByte are supposed to handle that torn state for 65001 and other code pages respectively already.

@german-one
Copy link
Contributor

german-one commented Jul 2, 2019

We posted at the same time 😀 I didn't see the static declaration in the first place. However, thanks for the explanation and insight.

So if anyone's tearing bytes in half, it should be the caller on the other end.

Yes, that's exactly what I believe is the reason for this problem. See what I did in my test code. If conhost gets the the data directly from cmd.exe (using its internal TYPE command) then we are facing this issue. If the same data is piped to my test app which takes care of incoming partials, and conhost gets chunks of data from the test app that don't contain partials anymore, then you don't have this problems anymore. So, the actual question is whether or not conhost is supposed to work around the deficiencies of applications like cmd.exe 😉 It would be nice, no doubt.

@german-one
Copy link
Contributor

german-one commented Jul 3, 2019

Update:
I was not able to find anything wrong in Utf8ToWideCharParser. Thus, I banned it entirely from WriteConsoleAImpl and implemented my own quick and dirty UTF-8 handling. Result was that nothing changed for the CMD output. But the BOM is getting removed in the WSL output.
@miniksa What do you think? Is the BOM removal something that should be included into the Utf8ToWideCharParser class? My expectation is that the BOM should not show up in the output. But on the other hand \EF \BB \BF shall be treated as BOM only if it is the beginning of the stream.

Next step for me is to review WriteConsoleOutputCharacterAImpl in _output.cpp. There is a real good chance to find the culprit 😀😀 It calls ConvertToW in convert.cpp. This function converts the incoming chunk to UTF-16 without any check for partial code points. I'll let you know ...

@miniksa
Copy link
Member

miniksa commented Jul 3, 2019

I'm of two minds on the BOM thing.

The first is that it's impossible to tell really what is the beginning of stream without some heuristics or further state on when a process connected and if it's the first thing that it has said. Probably more complicated than we want to deal with.

The second is that I can't think of a valid use for echoing a BOM to a Console/Terminal in the slightest and we should probably just wholesale discard them even though we should technically only do that as the "start of stream" rule. But instead of discarding them here, I think I'd rather pass it along further in the Utf8ToWideCharParser portion of the class and let the TextBuffer treat the BOM as a control character that "does nothing" than to strip it out in the parser layer. I have an agenda item on me to effectively add a WriteStream method to the TextBuffer to attempt to supercede both WriteCharsLegacy in conhost.exe and the WriteCharsLegacy2ElectricBoogaloo method in windowsterminal.exe. Those methods are basically responsible for stripping controls and performing an action anyway so it seems like an apt location to put something like this.

In short, leave the BOM alone for now and we'll revisit.

For the other part, GOOD CALL. You're right, the WriteConsoleOutputCharacterA method is likely doing things quite differently with ConvertToW. I always wanted to reconcile how the cell-write methods (OutputCharacter/OutputAttribute) worked versus the stream-write methods (Output) in terms of codepage conversion, but we never got to that. I forgot about that one when I was writing this yesterday. Hope you find something fruitful there and thank you for continuing to dig into this!

@german-one
Copy link
Contributor

Thanks for your advice regarding the BOM.

Unfortunately updating the WriteConsoleOutputCharacterA function still didn't resolve the CMD issue. Meanwhile I'm wondering if CMD.exe determines the file type of the standard output handle, and in case it is conhost, it converts the stream to UTF-16 internally. In this case I'm afraid we are out of luck.
I think I should write another small test app that does nothing but forwarding the piped stream from cmd.exe in order to investigate how it behaves.

@german-one
Copy link
Contributor

CMD.exe is an evil beast 👹

another test code:

#include "pch.h"
#include <string>
#include <algorithm>
#include <type_traits>
#include <windows.h>

bool GetConsoleVTHandles(HANDLE& refHIn, HANDLE& refHOut, HANDLE& refHErr)
{
  HANDLE hIn{ GetStdHandle(STD_INPUT_HANDLE) };
  HANDLE hOut{ GetStdHandle(STD_OUTPUT_HANDLE) };
  HANDLE hErr{ GetStdHandle(STD_ERROR_HANDLE) };
  if (hIn == INVALID_HANDLE_VALUE || hOut == INVALID_HANDLE_VALUE || hErr == INVALID_HANDLE_VALUE)
    return false;

  DWORD dwConsoleMode{};

  if (GetFileType(hIn) == FILE_TYPE_CHAR &&
      (!GetConsoleMode(hIn, &dwConsoleMode) || !SetConsoleMode(hIn, dwConsoleMode | ENABLE_VIRTUAL_TERMINAL_INPUT)))
    return false;

  if (GetFileType(hOut) == FILE_TYPE_CHAR &&
      (!GetConsoleMode(hOut, &dwConsoleMode) || !SetConsoleMode(hOut, dwConsoleMode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | DISABLE_NEWLINE_AUTO_RETURN)))
    return false;

  if (GetFileType(hErr) == FILE_TYPE_CHAR &&
      (!GetConsoleMode(hErr, &dwConsoleMode) || !SetConsoleMode(hErr, dwConsoleMode | ENABLE_VIRTUAL_TERMINAL_PROCESSING | DISABLE_NEWLINE_AUTO_RETURN)))
    return false;

  refHIn = hIn;
  refHOut = hOut;
  refHErr = hErr;
  return true;
}

int main()
{
  HANDLE hIn{ INVALID_HANDLE_VALUE };
  HANDLE hOut{ INVALID_HANDLE_VALUE };
  HANDLE hErr{ INVALID_HANDLE_VALUE };
  if (!GetConsoleVTHandles(hIn, hOut, hErr))
    return -1;

  BYTE buffer[8191]{ 0 };     // buffer for the chunk read
  DWORD dwRead{};            // length of a chunk read
  bool fSuccess{};
  int iterations{};

  // process the data of the standard input in a loop
  while (true)
  {
    // try to read data from standard input
    fSuccess = !!ReadFile(hIn, buffer, std::extent<decltype(buffer)>::value, &dwRead, nullptr);

    if (dwRead == 0)
      break;
    else if (!fSuccess)
      return -1;

	DWORD  dwWritten{};

	//fSuccess = !!WriteConsoleOutputCharacterA(hOut, (char*)buffer, dwRead, COORD{ 0, 10 }, &dwWritten);
	//SetConsoleCursorPosition(hOut, COORD{ 0, 35 });

	//fSuccess = !!WriteConsoleA(hOut, buffer, dwRead, &dwWritten, nullptr);

	fSuccess = !!WriteFile(hOut, buffer, dwRead, &dwWritten, nullptr);

	if (!fSuccess)
      return -1;

	++iterations;
  }

  return iterations;
}

It just reads from standard input and uses WriteConsoleOutputCharacterA, WriteConsoleA, or WriteFile to write to standard output (I compiled 3 versions). The tool returns the number of write operations.
Note that I used a buffer of 8192 bytes while the test pattern has a size of only 6642 bytes. I tested using the default conhost (no updated functions involved).


image
This screenshot shows that ConsoleOutputCharacterA can't be the reason for the faulty output. It is not even able to process Line Feed characters to wrap the lines. VT escape sequences are ignored, too. (I still believe that the function should be updated for UTF-8 partials though.)
It was written in 2 steps . No idea if CMD forwards the stream in two steps, if the involved pipe can't process all of the 6642 bytes at once, or if ReadFile has a limitation.


image
WriteConsoleA successfully writes the buffer read. Still with BOM as expected.


image
WriteFile behaves the same.


@miniksa My conclusion is that CMD.exe detects the file type (likely using GetFileType). If the type is FILE_TYPE_CHAR then it internally converts UTF-8 to UTF-16 and uses one of the ...W functions to write the output. Unless you have any further idea I'm afraid you can close this issue without having a solution, since CMD.exe is out of scope and as you guys explained several times you won't update it anymore.

However, apparently there are some functions that should be updated even if these updates are not able to resolve the CMD.exe issues. Any advice on how to proceed?

@miniksa
Copy link
Member

miniksa commented Jul 5, 2019

CMD.exe is an evil beast 👹

Yes. Very yes.

@miniksa My conclusion is that CMD.exe detects the file type (likely using GetFileType). If the type is FILE_TYPE_CHAR then it internally converts UTF-8 to UTF-16 and uses one of the ...W functions to write the output. Unless you have any further idea I'm afraid you can close this issue without having a solution, since CMD.exe is out of scope and as you guys explained several times you won't update it anymore.

I wouldn't be surprised if it was converting. I can try to spend like 5 minutes on checking if this is true by reading the internal source for CMD.exe on Monday when I get back into work, but it's not worth spending much time on because we can't really fix anything in CMD without breaking the world.

However, apparently there are some functions that should be updated even if these updates are not able to resolve the CMD.exe issues. Any advice on how to proceed?

Start a PR anyway with the proposed fixes to the other functions referencing this issue. It's OK if we only fix part of it.

@german-one
Copy link
Contributor

I can try to spend like 5 minutes on checking if this is true

I didn't want to ask you. I know you have limited time for things like that. However, I think it would be necessary to make sure my assumptions are correct and to close this issue.
Unfortunately I can't just disassemble CMD.exe to figure it out because I don't speak Assembly. But I can check if the expected functions are dynamically linked:
image
And they are ... 😉

Start a PR anyway with the proposed fixes to the other functions referencing this issue.

Okay I will spend some hours on the weekend to prepare it and do some tests.
Actually only two functions are left.

  1. ConhostConnection::_OutputThread()
    I think I'll leave my current fix including BOM removal because it seems to me that the function reads from the beginning of the stream.

  2. ApiRoutines::WriteConsoleOutputCharacterAImpl
    Here I'll try to get advantage of Utf8ToWideCharParser. Do you think I should try to include the DBCS handling like in WriteConsoleAImpl? I'm afraid I can't test it properly though 😟

@german-one
Copy link
Contributor

@miniksa
I'll start a PR for _OutputThread only. WriteConsoleOutputCharacterAImpl needs more considerations and agreements with you. For that reason I will refrain from further scope creeping in this thread and rather open a new issue.

@PhMajerus
Copy link

FWIW the misaligned text ".majerus.net/" is not related to my update. Probably another issue.

That misaligned text ".majerus.net/" is just a terminal hyperlink on the ANSI logo using an ESC]8 control sequence. Conhost and Terminal both properly ignore it as they don't support hyperlinks yet.
It's probably related to the same issue, UTF-8 encoding of some C0 control character gets broken, making the escape sequence impossible to parse properly and some of it gets shown instead of ignored.

@german-one
Copy link
Contributor

german-one commented Jul 8, 2019

Conhost and Terminal both properly ignore it

Kind of. The terminal does not entirely ignore it as it seems. I believe it's a matter of window updating. As soon as I resize the window (e.g. from maximized to normal) this remnant disappears miraculously.

@eryksun
Copy link

eryksun commented Dec 30, 2019

But CMD.exe doesn't usually do any of that. It's probably using whatever codepage the system is in and writing everything into the A API without thinking about it.

CMD only uses the console's wide-character API for console I/O, at least back to NT 4 in 1996, and probably all the way to NT 3.1. It does use the current console output codepage, but only indirectly. It uses it by default for encoding output written to disk files and pipes, and for decoding input read from files such as a line from a batch script or a line from the pipe in a for /f loop.

@zadjii-msft
Copy link
Member

I'm closing this since it was largely fixed by #1850, and the follow-up discussion is tracked in #1851. Thanks again everyone!

@ghost ghost added the Needs-Tag-Fix Doesn't match tag requirements label Jan 4, 2022
@german-one
Copy link
Contributor

People who are still coming across this topic: The initial post describes a behavior which is almost certainly a bug in cmd.exe. It seems to convert the read UTF-8 chunks with lacerated code points and in the consequence writes ruined UTF-16. No terminal application can heal it anymore. Thus, a fix for #1851 won't ever fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-2 A description (P2) Product-Conhost For issues in the Console codebase Work-Item It's being tracked by an actual work item internally. (to be removed soon)
Projects
None yet
Development

No branches or pull requests

7 participants