Try to fix unicode layout issue #3578

skyline75489 · 2019-11-14T11:11:44Z

Summary of the Pull Request

This could be the first of a series PRs intended to fix unicode layout issue.

References

#3546

PR Checklist

Closes #xxx
CLA signed. If not, go over here and sign the CLA
Tests added/passed
Requires documentation to be updated
I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

Detailed Description of the Pull Request / Additional comments

Contrary to what @reli-msft emphasize in #3546, I found that zero length Unicode codepoint handling is the key here.

I searched a little and found https://gitlab.freedesktop.org/terminal-wg/specifications/issues/9 which refers https://github.com/ridiculousfish/widecharwidth. Then I was like, oh so there's bunch of unicode codepoint that's not included in WT currently.

Handling zero-width Unicode characters makes a lot of things weird. But here's what I got so far:

Would love to hear your guys' opinion on this.

Validation Steps Performed

skyline75489 · 2019-11-14T14:12:14Z

Everything except Korean language is good now:

I couldn't make Hangul working because DWrite is treating them as they had single cell width. Maybe I missed something.

reli-msft · 2019-11-15T01:46:28Z

I think we need to do something at cell-allocation level, when we write a string into the buffer, we need to :

Properly break the string into clusters (with consideration of combining/zwj/VS/...) and allocate a cell (1 or 2 column) for them;
Categorize the clusters, decide:
- Whether they are “adhensive”, i.e., whether they should be shaped together with the adjacent clusters under same category;
- If their actual width is wider than the desired width, wwhat should we do (downscale/center/align left/align right)

And, when performing text layout, group adhensive clusters together and shape them together, then perform cluster-to-column alignment according to the cluster category.

Category flags:

enum ClusterCategory {
    ScriptCategoryMask   = 0xFF,
    // Non-adhensive scripts
    ScriptUnknown        = 0x00,
    ScriptIdeographs     = 0x01,
    ScriptEmoji          = 0x02,
    ScriptGeometric      = 0x03,
    ScriptNexusGeometric = 0x04, // Box drawing, Block Element, Powerline, etc.
    // Add more items when necessary
    Adhensive            = 0x80,
    // Adhensive scripts
    ScriptWestern        = 0x80,
    // Add more items when necessary

    // Alignment
    AlignCenter          = 0,
    AlignScaleDown       = 0x200,
    AlignLeft            = 0x400,
    AlignRight           = 0x600,
    AlignmentMask        = 0x600,

// Width
    DoubleWidth          = 0x400
}

skyline75489 · 2019-11-15T02:02:53Z

@reli-msft Yeah I think you are right. Without cluster category, some code just doesn't really make sense. For example, in order to handle zero-width VS16 the text cell iterator is moving like this:

it += columnCount > 0 ? columnCount : 2;

If columnCount is zero, you need to move two cell ahead, because VS16 takes up space in string, just like ordinary CJK characters. That's just like "what the hell". The code can't explain itself, even with comments.

To thoroughly handle all kinds of Unicode thing, cluster category classification is a must.

skyline75489 · 2019-11-15T02:04:54Z

src/types/CodepointWidthDetector.cpp

@@ -75,7 +75,7 @@ namespace
        UnicodeRange{ 0x2d8, 0x2db, CodepointWidth::Ambiguous },
        UnicodeRange{ 0x2dd, 0x2dd, CodepointWidth::Ambiguous },
        UnicodeRange{ 0x2df, 0x2df, CodepointWidth::Ambiguous },
-        UnicodeRange{ 0x300, 0x36f, CodepointWidth::Ambiguous },
+        UnicodeRange{ 0x300, 0x36f, CodepointWidth::Combining },


This is definitely wrong since it's generated. Just showing the basic idea.

reli-msft · 2019-11-15T02:08:55Z

@skyline75489 I plan to have a meeting with Console people to find out a proper solution for this.
Since Windows Terminal do not care too much about compatibility, it is the time to do the right thing 😎.

skyline75489 · 2019-11-15T02:10:30Z

@reli-msft You made my day, bro. Go get them.

skyline75489 · 2019-11-15T02:11:45Z

@miniksa Come to think about it, maybe #3458 should be merged first. Right now, cluster handling code are scattered everywhere. #3458 at least give us a chance to handle cluster moving in one place inside RenderClusterIterator.

DHowett-MSFT · 2019-11-15T02:35:36Z

Since Windows Terminal do not care too much about compatibility, it is the time to do the right thing 😎.

Well, now, hold on. This is a good sentiment and all that, but still need to figure out how to deal with applications that read back the contents of the screen adn ahev their understanding of their codepoint to cell mapping violated. This isn't just a problem somebody can come in and solve overnight: Terminal still needs to act compatibly for legacy win32 console applications.

There's already #1472 tracking lighting up ZWJ/ZWNJ and any rendering and buffer concerns where N codepoints maps to M cells. This is work we know needs to be done, I just don't have the folkpower to staff it. 😄

Let it not be understated that we want to do the right thing here. It's just that we have to do the right thing in the context of 35 years worth of application support code and the fact that conhost (who also consumes this codepoint width detector, text buffer and text renderer) is the API server for all text-mode applications on every SKU of Windows everywhere.

DHowett-MSFT · 2019-11-15T02:36:40Z

To wit: WT can break UI paradigm compatibility and some level of application compatibility, but the code it shares with conhost must uphold the same compatibility guarantees as conhost does.

skyline75489 · 2019-11-15T02:43:38Z

@DHowett-MSFT Easy. If I keep doing open source thing, the company I'm working at will soon be bankrupted and then I can work for MSFT 😎 . Everyone is happy.

Glad to know there's #1472. Gonna subscribe it right now.

DHowett-MSFT · 2019-11-15T02:44:28Z

LOL. 😁

reli-msft · 2019-11-16T01:48:10Z

I couldn't make Hangul working because DWrite is treating them as they had single cell width. Maybe I missed something.

Hangul is something more tricky: not all fonts support composite Hangul. I suggest we can isolate the Jamos. Or do NFKC-like composition...

## Summary of the Pull Request This change tries to fix column size calculation when shaping return glyphs that represents multiple characters (e.g. ligature).  ## References This should fix #696.  ## PR Checklist * [ ] Closes #xxx * [X] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA * [ ] Tests added/passed * [ ] Requires documentation to be updated * [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx  ## Detailed Description of the Pull Request / Additional comments Currently, it seems like CustomTextLayout::_CorrectGlyphRun generally assumes that glyphs and characters have a 1:1 mapping relationship - which holds true for most trivial scenarios with basic western scripts, and also many, but unfortunately not all, monospace "programming" fonts with programming ligatures. This change makes terminal correctly processes glyphs that represents multiple characters, by properly accumulating the column counts of all these characters together (which I believe is more close to what this code originally intended to do). There are still many issues existing in both CustomTextLayout as well as the TextBuffer, and the correct solution to them will likely demand large-scale changes, at least at the scale of #3578. I wish small changes like this can serve as a stop gap solution while we take our time to work on the long-term right thing.  ## Validation Steps Performed Builds and runs. Manual testing confirmed that it solves #696 with both LigConsalata and Fixedsys Excelsior.

## Summary of the Pull Request This change tries to fix column size calculation when shaping return glyphs that represents multiple characters (e.g. ligature).  ## References This should fix #696.  ## PR Checklist * [ ] Closes #xxx * [X] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA * [ ] Tests added/passed * [ ] Requires documentation to be updated * [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx  ## Detailed Description of the Pull Request / Additional comments Currently, it seems like CustomTextLayout::_CorrectGlyphRun generally assumes that glyphs and characters have a 1:1 mapping relationship - which holds true for most trivial scenarios with basic western scripts, and also many, but unfortunately not all, monospace "programming" fonts with programming ligatures. This change makes terminal correctly processes glyphs that represents multiple characters, by properly accumulating the column counts of all these characters together (which I believe is more close to what this code originally intended to do). There are still many issues existing in both CustomTextLayout as well as the TextBuffer, and the correct solution to them will likely demand large-scale changes, at least at the scale of #3578. I wish small changes like this can serve as a stop gap solution while we take our time to work on the long-term right thing.  ## Validation Steps Performed Builds and runs. Manual testing confirmed that it solves #696 with both LigConsalata and Fixedsys Excelsior. (cherry picked from commit 027f122)

This is a subset of #3578 which I think is harmless and the first step towards making things right. References #3546 #3578 ## Detailed Description of the Pull Request / Additional comments For more robust Unicode support, `CodepointWidthDetector` should provide concrete width information rather than a simple boolean of `IsWide`. Currently only `IsWide` is widely used and optimized using quick lookup table and fallback cache. This PR moves those optimization into `GetWidth`. ## Validation Steps Performed API remains unchanged. Things are not broken.

skyline75489 added 2 commits November 14, 2019 19:05

Try to fix unicode layout issue

a62e7ce

That's better

68a1e17

zadjii-msft requested a review from miniksa November 14, 2019 17:41

Run cluseter

96ea811

skyline75489 commented Nov 15, 2019

View reviewed changes

skyline75489 mentioned this pull request Nov 16, 2019

The text layout code is expecting the quantity of glyphs is equal to the cells allocated for each line #3546

Closed

skyline75489 mentioned this pull request Nov 27, 2019

Make CodepointWidthDetector::GetWidth faster #3727

Merged

milizhang mentioned this pull request Dec 30, 2019

Fix column count issues with certain ligature. #4081

Merged

5 tasks

dboytherealest1000 approved these changes Mar 13, 2020

View reviewed changes

skyline75489 mentioned this pull request Apr 18, 2020

Introduce UnicodeAttribute for better Unicode support #5407

Closed

5 tasks

skyline75489 closed this Apr 18, 2020

skyline75489 deleted the fix/unicode-lay-thehell-out branch February 9, 2021 06:10

skyline75489 mentioned this pull request Jun 19, 2021

Add a DxRenderer based on a glyph atlas #10461

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Try to fix unicode layout issue #3578

Try to fix unicode layout issue #3578

skyline75489 commented Nov 14, 2019

skyline75489 commented Nov 14, 2019

reli-msft commented Nov 15, 2019 •

edited

Loading

skyline75489 commented Nov 15, 2019 •

edited

Loading

skyline75489 Nov 15, 2019

reli-msft commented Nov 15, 2019

skyline75489 commented Nov 15, 2019

skyline75489 commented Nov 15, 2019

DHowett-MSFT commented Nov 15, 2019

DHowett-MSFT commented Nov 15, 2019

skyline75489 commented Nov 15, 2019

DHowett-MSFT commented Nov 15, 2019

reli-msft commented Nov 16, 2019 •

edited

Loading

Try to fix unicode layout issue #3578

Try to fix unicode layout issue #3578

Conversation

skyline75489 commented Nov 14, 2019

Summary of the Pull Request

References

PR Checklist

Detailed Description of the Pull Request / Additional comments

Validation Steps Performed

skyline75489 commented Nov 14, 2019

reli-msft commented Nov 15, 2019 • edited Loading

skyline75489 commented Nov 15, 2019 • edited Loading

skyline75489 Nov 15, 2019

Choose a reason for hiding this comment

reli-msft commented Nov 15, 2019

skyline75489 commented Nov 15, 2019

skyline75489 commented Nov 15, 2019

DHowett-MSFT commented Nov 15, 2019

DHowett-MSFT commented Nov 15, 2019

skyline75489 commented Nov 15, 2019

DHowett-MSFT commented Nov 15, 2019

reli-msft commented Nov 16, 2019 • edited Loading

reli-msft commented Nov 15, 2019 •

edited

Loading

skyline75489 commented Nov 15, 2019 •

edited

Loading

reli-msft commented Nov 16, 2019 •

edited

Loading