Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to fix unicode layout issue #3578

Closed

Conversation

skyline75489
Copy link
Collaborator

Summary of the Pull Request

This could be the first of a series PRs intended to fix unicode layout issue.

References

#3546

PR Checklist

  • Closes #xxx
  • CLA signed. If not, go over here and sign the CLA
  • Tests added/passed
  • Requires documentation to be updated
  • I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

Detailed Description of the Pull Request / Additional comments

Contrary to what @reli-msft emphasize in #3546, I found that zero length Unicode codepoint handling is the key here.

I searched a little and found https://gitlab.freedesktop.org/terminal-wg/specifications/issues/9 which refers https://github.com/ridiculousfish/widecharwidth. Then I was like, oh so there's bunch of unicode codepoint that's not included in WT currently.

Handling zero-width Unicode characters makes a lot of things weird. But here's what I got so far:

图片

Would love to hear your guys' opinion on this.

Validation Steps Performed

@skyline75489
Copy link
Collaborator Author

Everything except Korean language is good now:

图片

I couldn't make Hangul working because DWrite is treating them as they had single cell width. Maybe I missed something.

@reli-msft
Copy link

reli-msft commented Nov 15, 2019

I think we need to do something at cell-allocation level, when we write a string into the buffer, we need to :

  • Properly break the string into clusters (with consideration of combining/zwj/VS/...) and allocate a cell (1 or 2 column) for them;
  • Categorize the clusters, decide:
    • Whether they are “adhensive”, i.e., whether they should be shaped together with the adjacent clusters under same category;
    • If their actual width is wider than the desired width, wwhat should we do (downscale/center/align left/align right)

And, when performing text layout, group adhensive clusters together and shape them together, then perform cluster-to-column alignment according to the cluster category.

Category flags:

enum ClusterCategory {
    ScriptCategoryMask   = 0xFF,
    // Non-adhensive scripts
    ScriptUnknown        = 0x00,
    ScriptIdeographs     = 0x01,
    ScriptEmoji          = 0x02,
    ScriptGeometric      = 0x03,
    ScriptNexusGeometric = 0x04, // Box drawing, Block Element, Powerline, etc.
    // Add more items when necessary
    Adhensive            = 0x80,
    // Adhensive scripts
    ScriptWestern        = 0x80,
    // Add more items when necessary

    // Alignment
    AlignCenter          = 0,
    AlignScaleDown       = 0x200,
    AlignLeft            = 0x400,
    AlignRight           = 0x600,
    AlignmentMask        = 0x600,

// Width
    DoubleWidth          = 0x400
}

@skyline75489
Copy link
Collaborator Author

skyline75489 commented Nov 15, 2019

@reli-msft Yeah I think you are right. Without cluster category, some code just doesn't really make sense. For example, in order to handle zero-width VS16 the text cell iterator is moving like this:

it += columnCount > 0 ? columnCount : 2;

If columnCount is zero, you need to move two cell ahead, because VS16 takes up space in string, just like ordinary CJK characters. That's just like "what the hell". The code can't explain itself, even with comments.

To thoroughly handle all kinds of Unicode thing, cluster category classification is a must.

@@ -75,7 +75,7 @@ namespace
UnicodeRange{ 0x2d8, 0x2db, CodepointWidth::Ambiguous },
UnicodeRange{ 0x2dd, 0x2dd, CodepointWidth::Ambiguous },
UnicodeRange{ 0x2df, 0x2df, CodepointWidth::Ambiguous },
UnicodeRange{ 0x300, 0x36f, CodepointWidth::Ambiguous },
UnicodeRange{ 0x300, 0x36f, CodepointWidth::Combining },
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is definitely wrong since it's generated. Just showing the basic idea.

@reli-msft
Copy link

@skyline75489 I plan to have a meeting with Console people to find out a proper solution for this.
Since Windows Terminal do not care too much about compatibility, it is the time to do the right thing 😎.

@skyline75489
Copy link
Collaborator Author

@reli-msft You made my day, bro. Go get them.

@skyline75489
Copy link
Collaborator Author

@miniksa Come to think about it, maybe #3458 should be merged first. Right now, cluster handling code are scattered everywhere. #3458 at least give us a chance to handle cluster moving in one place inside RenderClusterIterator.

@DHowett-MSFT
Copy link
Contributor

Since Windows Terminal do not care too much about compatibility, it is the time to do the right thing 😎.

Well, now, hold on. This is a good sentiment and all that, but still need to figure out how to deal with applications that read back the contents of the screen adn ahev their understanding of their codepoint to cell mapping violated. This isn't just a problem somebody can come in and solve overnight: Terminal still needs to act compatibly for legacy win32 console applications.

There's already #1472 tracking lighting up ZWJ/ZWNJ and any rendering and buffer concerns where N codepoints maps to M cells. This is work we know needs to be done, I just don't have the folkpower to staff it. 😄

Let it not be understated that we want to do the right thing here. It's just that we have to do the right thing in the context of 35 years worth of application support code and the fact that conhost (who also consumes this codepoint width detector, text buffer and text renderer) is the API server for all text-mode applications on every SKU of Windows everywhere.

@DHowett-MSFT
Copy link
Contributor

To wit: WT can break UI paradigm compatibility and some level of application compatibility, but the code it shares with conhost must uphold the same compatibility guarantees as conhost does.

@skyline75489
Copy link
Collaborator Author

@DHowett-MSFT Easy. If I keep doing open source thing, the company I'm working at will soon be bankrupted and then I can work for MSFT 😎 . Everyone is happy.

Glad to know there's #1472. Gonna subscribe it right now.

@DHowett-MSFT
Copy link
Contributor

LOL. 😁

@reli-msft
Copy link

reli-msft commented Nov 16, 2019

I couldn't make Hangul working because DWrite is treating them as they had single cell width. Maybe I missed something.

Hangul is something more tricky: not all fonts support composite Hangul. I suggest we can isolate the Jamos. Or do NFKC-like composition...
image

ghost pushed a commit that referenced this pull request Jan 21, 2020
<!-- Enter a brief description/summary of your PR here. What does it fix/what does it change/how was it tested (even manually, if necessary)? -->
## Summary of the Pull Request
This change tries to fix column size calculation when shaping return glyphs that represents multiple characters (e.g. ligature).

<!-- Other than the issue solved, is this relevant to any other issues/existing PRs? --> 
## References

This should fix #696.

<!-- Please review the items on the PR checklist before submitting-->
## PR Checklist
* [ ] Closes #xxx
* [X] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA
* [ ] Tests added/passed
* [ ] Requires documentation to be updated
* [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

<!-- Provide a more detailed description of the PR, other things fixed or any additional comments/features here -->
## Detailed Description of the Pull Request / Additional comments

Currently, it seems like CustomTextLayout::_CorrectGlyphRun generally assumes that glyphs and characters have a 1:1 mapping relationship - which holds true for most trivial scenarios with basic western scripts, and also many, but unfortunately not all, monospace "programming" fonts with programming ligatures.

This change makes terminal correctly processes glyphs that represents multiple characters, by properly accumulating the column counts of all these characters together (which I believe is more close to what this code originally intended to do).

There are still many issues existing in both CustomTextLayout as well as the TextBuffer, and the correct solution to them will likely demand large-scale changes, at least at the scale of #3578. I wish small changes like this can serve as a stop gap solution while we take our time to work on the long-term right thing.

<!-- Describe how you validated the behavior. Add automated tests wherever possible, but list manual validation steps taken as well -->
## Validation Steps Performed

Builds and runs. Manual testing confirmed that it solves #696 with both LigConsalata and Fixedsys Excelsior.
DHowett-MSFT pushed a commit that referenced this pull request Jan 24, 2020
<!-- Enter a brief description/summary of your PR here. What does it fix/what does it change/how was it tested (even manually, if necessary)? -->
## Summary of the Pull Request
This change tries to fix column size calculation when shaping return glyphs that represents multiple characters (e.g. ligature).

<!-- Other than the issue solved, is this relevant to any other issues/existing PRs? -->
## References

This should fix #696.

<!-- Please review the items on the PR checklist before submitting-->
## PR Checklist
* [ ] Closes #xxx
* [X] CLA signed. If not, go over [here](https://cla.opensource.microsoft.com/microsoft/Terminal) and sign the CLA
* [ ] Tests added/passed
* [ ] Requires documentation to be updated
* [ ] I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

<!-- Provide a more detailed description of the PR, other things fixed or any additional comments/features here -->
## Detailed Description of the Pull Request / Additional comments

Currently, it seems like CustomTextLayout::_CorrectGlyphRun generally assumes that glyphs and characters have a 1:1 mapping relationship - which holds true for most trivial scenarios with basic western scripts, and also many, but unfortunately not all, monospace "programming" fonts with programming ligatures.

This change makes terminal correctly processes glyphs that represents multiple characters, by properly accumulating the column counts of all these characters together (which I believe is more close to what this code originally intended to do).

There are still many issues existing in both CustomTextLayout as well as the TextBuffer, and the correct solution to them will likely demand large-scale changes, at least at the scale of #3578. I wish small changes like this can serve as a stop gap solution while we take our time to work on the long-term right thing.

<!-- Describe how you validated the behavior. Add automated tests wherever possible, but list manual validation steps taken as well -->
## Validation Steps Performed

Builds and runs. Manual testing confirmed that it solves #696 with both LigConsalata and Fixedsys Excelsior.

(cherry picked from commit 027f122)
ghost pushed a commit that referenced this pull request Apr 4, 2020
This is a subset of #3578 which I think is harmless and the first step towards making things right.
References #3546 #3578 

## Detailed Description of the Pull Request / Additional comments

For more robust Unicode support, `CodepointWidthDetector` should provide concrete width information rather than a simple boolean of `IsWide`. Currently only `IsWide` is widely used and optimized using quick lookup table and fallback cache. This PR moves those optimization into `GetWidth`.

## Validation Steps Performed

API remains unchanged. Things are not broken.
DHowett-MSFT pushed a commit that referenced this pull request Apr 14, 2020
This is a subset of #3578 which I think is harmless and the first step towards making things right.
References #3546 #3578 

## Detailed Description of the Pull Request / Additional comments

For more robust Unicode support, `CodepointWidthDetector` should provide concrete width information rather than a simple boolean of `IsWide`. Currently only `IsWide` is widely used and optimized using quick lookup table and fallback cache. This PR moves those optimization into `GetWidth`.

## Validation Steps Performed

API remains unchanged. Things are not broken.
@skyline75489 skyline75489 deleted the fix/unicode-lay-thehell-out branch February 9, 2021 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants