Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WriteConsoleOutputCharacterA doesn't merge UTF-8 partials in successive calls #1851

Open
german-one opened this issue Jul 6, 2019 · 7 comments
Labels
Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Product-Conhost For issues in the Console codebase
Milestone

Comments

@german-one
Copy link
Contributor

Environment

Windows build number: [run "ver" at a command prompt]
Microsoft Windows [Version 10.0.18362.207]

Steps to reproduce

If a UTF-8 stream gets buffered in a loop, characters that consume more than one byte may get split at the buffer boundaries. Passing the buffer to WriteConsoleOutputCharacterA will corrupt the text because a conversion to UTF-16 is in place where these partials are treated as invalid UTF-8 characters and replaced with with U+FFFD characters.

Expected behavior

WriteConsoleOutputCharacterA should cache the partials and prepend them to the characters passed at the next call of this function, similar to the behavior of WriteConsoleA.

Actual behavior

UTF-8 partials result in corrupted text.
A discussion about this already began in #386 but was rather out of scope in this issue.

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Jul 6, 2019
@german-one
Copy link
Contributor Author

german-one commented Jul 6, 2019

@miniksa
This is continuing #386.

I was able to include the Utf8ToWideCharParser class into ApiRoutines::WriteConsoleOutputCharacterAImpl. This resolves the problem of merging UTF-8 partials but raises another issue. I'm struggling with the used parameter. I won't start a PR unless we have the same understanding of what number has to be assigned to it. My current understanding is that it is the number of bytes consumed from chars. But does it include the number of bytes of the partials (either from the previous or from the current cache)? And what's the benefit of getting this information as the caller of this function? In my tests I tried to calculate the coordinates of the end of the previously written chunk in order to continue writing at this position. This doesn't work at all if I try to use the bytes written. I would rather need the number of character cells written. Would it be reasonable to use the number of characters created that I get from Utf8ToWideCharParser::Parse (likewise the characters written by WriteConsoleOutputCharacterWImpl)?

@DHowett-MSFT DHowett-MSFT added Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Product-Conhost For issues in the Console codebase and removed Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Jul 8, 2019
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Jul 8, 2019
@DHowett-MSFT DHowett-MSFT added this to the 20H1 milestone Jul 8, 2019
@DHowett-MSFT DHowett-MSFT modified the milestones: 20H1, 21H1 Aug 29, 2019
@j4james
Copy link
Collaborator

j4james commented Feb 20, 2020

The discussion in #386 seemed to suggest that this problem had been around since Windows 7, but what I'm seeing definitely looks like a recent regression, so I'm not sure if it's the same issue. In my current Windows version (10.0.18362.535), the following sequences both output a smiling face when executed in a bash shell.

printf "\xE2\x98\xBA\n"
printf "\xE2"; printf "\x98\xBA\n"

However, when I start a conhost shell built from the latest source (commit 39d3c65), the second sequence fails to decode the UTF-8 correctly, and three error glyphs are output instead.

Screenshots of the old and new consoles...

image

@german-one
Copy link
Contributor Author

Not sure what functions are invoked under the hood if you call printf in WSL. Although I'm pretty sure that the behavior shouldn't have changed if it was WriteConsoleOutputCharacterAImpl. The ConvertToW function is called there which converts the passed string to UTF-16 without caching UTF-8 partials. That's the reason why I once filed this issue.

However, issue #4086 has been fixed with PR #4422 which indicates that WriteConsoleAImpl is involved. Not sure if I did something wrong when I updated this function.

@j4james
Copy link
Collaborator

j4james commented Feb 20, 2020

I'm afraid it does look like PR #4422 is to blame, at least for my particular test case. It works in commit 0d92f71, but fails after #4422 is applied in commit 06b3931.

It seems that the old Utf8ToWideCharParser::Parse method does actually detect \xE2 as a partial (and then successfully combines it with the following two bytes when they later arrive), but the new u8u16state code doesn't. The utf8 conversion method just returns the string as if it were a complete set of utf8 code points, leaving _partialsLen as zero. It's then passed on to the MultiByteToWideChar API where it gets converted into an error glyph.

@german-one
Copy link
Contributor Author

I updated the unit test with this sequences and it failed. So it's definitely reproduceable. I'll file an issue referring to your comment here. And I'll fix it as soon as I can. Thanks for letting me know!

@german-one
Copy link
Contributor Author

@zadjii-msft
I'm struggling to close this myself because it is an accepted issues and added to a milestone. However, meanwhile I'm not sure about it anymore.

WriteConsoleOutputCharacterA is no longer a part of the ecosystem roadmap, as remarked in the docs.
WriteConsoleOutputCharacterA does not and did never respect character boundaries of DBCSs.

The docs also read

"Copies a number of characters to consecutive cells of a console screen buffer, ..."

  • What is the meaning of "characters" here? (Is it actually "the representation of code points"? Or is it rather "values of a C/C++ character type"?)
  • What separates WriteConsoleOutputCharacterA from WriteConsoleA? (There might be a reason why they behave differently.)
  • How inconsistent is it to fix UTF-8 handling without fixing DBCS handling?
  • Do we break code if we try to fix it?

Consider to use the answers to these questions as reasons to close this one out.

Steffen

@miniksa
Copy link
Member

miniksa commented Mar 11, 2022

  • What is the meaning of "characters" here? (Is it actually "the representation of code points"? Or is it rather "values of a C/C++ character type"?)

In the context of WriteConsoleOutputCharacterA, it means one char into one cell of the buffer. Does it make wonderful sense? No. But it is the way that it is when we inherited it.

  • What separates WriteConsoleOutputCharacterA from WriteConsoleA? (There might be a reason why they behave differently.)

WriteConsoleOutputCharacterA is just going to modify the character without modifying the color attributes in the cell. It will insert pretty much exactly what you give it exactly into the cell you tell it to.
WriteConsoleA will run normal character handling against it as if it were otherwise streamed in including often processing backspaces, newlines, etc. and applying the active/default color attributes to whatever you're inserting.

  • How inconsistent is it to fix UTF-8 handling without fixing DBCS handling?

It's probably fine honestly to fix the UTF-8 handling as a separate path. We did that in a few places... there's 3 states... W, A, and A when 65001 (UTF-8) is set. That way we can maintain the nice and broken A state for compatibility reasons and fix up the UTF-8 that we actually care about independently.

  • Do we break code if we try to fix it?

Oh probably.

Consider to use the answers to these questions as reasons to close this one out.

I'll let it hang around for now, but don't feel obligated to further bump it. If it's on 22H2 we're going to keep looking at it ourselves.

@zadjii-msft zadjii-msft modified the milestones: 22H2, Backlog Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Product-Conhost For issues in the Console codebase
Projects
None yet
Development

No branches or pull requests

5 participants