WriteConsoleOutputCharacterA doesn't merge UTF-8 partials in successive calls #1851

german-one · 2019-07-06T21:55:41Z

Environment

Windows build number: [run "ver" at a command prompt]
Microsoft Windows [Version 10.0.18362.207]

Steps to reproduce

If a UTF-8 stream gets buffered in a loop, characters that consume more than one byte may get split at the buffer boundaries. Passing the buffer to WriteConsoleOutputCharacterA will corrupt the text because a conversion to UTF-16 is in place where these partials are treated as invalid UTF-8 characters and replaced with with U+FFFD characters.

Expected behavior

WriteConsoleOutputCharacterA should cache the partials and prepend them to the characters passed at the next call of this function, similar to the behavior of WriteConsoleA.

Actual behavior

UTF-8 partials result in corrupted text.
A discussion about this already began in #386 but was rather out of scope in this issue.

The text was updated successfully, but these errors were encountered:

german-one · 2019-07-06T22:22:57Z

@miniksa
This is continuing #386.

I was able to include the Utf8ToWideCharParser class into ApiRoutines::WriteConsoleOutputCharacterAImpl. This resolves the problem of merging UTF-8 partials but raises another issue. I'm struggling with the used parameter. I won't start a PR unless we have the same understanding of what number has to be assigned to it. My current understanding is that it is the number of bytes consumed from chars. But does it include the number of bytes of the partials (either from the previous or from the current cache)? And what's the benefit of getting this information as the caller of this function? In my tests I tried to calculate the coordinates of the end of the previously written chunk in order to continue writing at this position. This doesn't work at all if I try to use the bytes written. I would rather need the number of character cells written. Would it be reasonable to use the number of characters created that I get from Utf8ToWideCharParser::Parse (likewise the characters written by WriteConsoleOutputCharacterWImpl)?

j4james · 2020-02-20T01:08:12Z

The discussion in #386 seemed to suggest that this problem had been around since Windows 7, but what I'm seeing definitely looks like a recent regression, so I'm not sure if it's the same issue. In my current Windows version (10.0.18362.535), the following sequences both output a smiling face when executed in a bash shell.

printf "\xE2\x98\xBA\n"
printf "\xE2"; printf "\x98\xBA\n"

However, when I start a conhost shell built from the latest source (commit 39d3c65), the second sequence fails to decode the UTF-8 correctly, and three error glyphs are output instead.

Screenshots of the old and new consoles...

german-one · 2020-02-20T20:32:38Z

Not sure what functions are invoked under the hood if you call printf in WSL. Although I'm pretty sure that the behavior shouldn't have changed if it was WriteConsoleOutputCharacterAImpl. The ConvertToW function is called there which converts the passed string to UTF-16 without caching UTF-8 partials. That's the reason why I once filed this issue.

However, issue #4086 has been fixed with PR #4422 which indicates that WriteConsoleAImpl is involved. Not sure if I did something wrong when I updated this function.

j4james · 2020-02-20T23:42:15Z

I'm afraid it does look like PR #4422 is to blame, at least for my particular test case. It works in commit 0d92f71, but fails after #4422 is applied in commit 06b3931.

It seems that the old Utf8ToWideCharParser::Parse method does actually detect \xE2 as a partial (and then successfully combines it with the following two bytes when they later arrive), but the new u8u16state code doesn't. The utf8 conversion method just returns the string as if it were a complete set of utf8 code points, leaving _partialsLen as zero. It's then passed on to the MultiByteToWideChar API where it gets converted into an error glyph.

german-one · 2020-02-21T00:32:42Z

I updated the unit test with this sequences and it failed. So it's definitely reproduceable. I'll file an issue referring to your comment here. And I'll fix it as soon as I can. Thanks for letting me know!

german-one · 2022-01-15T17:58:00Z

@zadjii-msft
I'm struggling to close this myself because it is an accepted issues and added to a milestone. However, meanwhile I'm not sure about it anymore.

WriteConsoleOutputCharacterA is no longer a part of the ecosystem roadmap, as remarked in the docs.
WriteConsoleOutputCharacterA does not and did never respect character boundaries of DBCSs.

The docs also read

"Copies a number of characters to consecutive cells of a console screen buffer, ..."

What is the meaning of "characters" here? (Is it actually "the representation of code points"? Or is it rather "values of a C/C++ character type"?)
What separates WriteConsoleOutputCharacterA from WriteConsoleA? (There might be a reason why they behave differently.)
How inconsistent is it to fix UTF-8 handling without fixing DBCS handling?
Do we break code if we try to fix it?

Consider to use the answers to these questions as reasons to close this one out.

Steffen

miniksa · 2022-03-11T18:34:53Z

What is the meaning of "characters" here? (Is it actually "the representation of code points"? Or is it rather "values of a C/C++ character type"?)

In the context of WriteConsoleOutputCharacterA, it means one char into one cell of the buffer. Does it make wonderful sense? No. But it is the way that it is when we inherited it.

What separates WriteConsoleOutputCharacterA from WriteConsoleA? (There might be a reason why they behave differently.)

WriteConsoleOutputCharacterA is just going to modify the character without modifying the color attributes in the cell. It will insert pretty much exactly what you give it exactly into the cell you tell it to.
WriteConsoleA will run normal character handling against it as if it were otherwise streamed in including often processing backspaces, newlines, etc. and applying the active/default color attributes to whatever you're inserting.

How inconsistent is it to fix UTF-8 handling without fixing DBCS handling?

It's probably fine honestly to fix the UTF-8 handling as a separate path. We did that in a few places... there's 3 states... W, A, and A when 65001 (UTF-8) is set. That way we can maintain the nice and broken A state for compatibility reasons and fix up the UTF-8 that we actually care about independently.

Do we break code if we try to fix it?

Oh probably.

Consider to use the answers to these questions as reasons to close this one out.

I'll let it hang around for now, but don't feel obligated to further bump it. If it's on 22H2 we're going to keep looking at it ourselves.

ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Jul 6, 2019

ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Jul 8, 2019

DHowett-MSFT added this to the 20H1 milestone Jul 8, 2019

DHowett-MSFT modified the milestones: 20H1, 21H1 Aug 29, 2019

german-one mentioned this issue Feb 21, 2020

Recognizing of UTF-8 partials in til :: u8u16 may fail #4673

Closed

zadjii-msft modified the milestones: Windows vNext, 22H2 Jan 4, 2022

zadjii-msft mentioned this issue Jan 4, 2022

Bad characters occasionally displayed when writing lots of identical UTF-8 lines #386

Closed

zadjii-msft modified the milestones: 22H2, Backlog Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WriteConsoleOutputCharacterA doesn't merge UTF-8 partials in successive calls #1851

WriteConsoleOutputCharacterA doesn't merge UTF-8 partials in successive calls #1851

german-one commented Jul 6, 2019

german-one commented Jul 6, 2019 •

edited

Loading

j4james commented Feb 20, 2020

german-one commented Feb 20, 2020

j4james commented Feb 20, 2020

german-one commented Feb 21, 2020

german-one commented Jan 15, 2022

miniksa commented Mar 11, 2022 •

edited

Loading

WriteConsoleOutputCharacterA doesn't merge UTF-8 partials in successive calls #1851

WriteConsoleOutputCharacterA doesn't merge UTF-8 partials in successive calls #1851

Comments

german-one commented Jul 6, 2019

Environment

Steps to reproduce

Expected behavior

Actual behavior

german-one commented Jul 6, 2019 • edited Loading

j4james commented Feb 20, 2020

german-one commented Feb 20, 2020

j4james commented Feb 20, 2020

german-one commented Feb 21, 2020

german-one commented Jan 15, 2022

miniksa commented Mar 11, 2022 • edited Loading

german-one commented Jul 6, 2019 •

edited

Loading

miniksa commented Mar 11, 2022 •

edited

Loading