Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode "negative squared latin" letters not picked up by normalisation algorithm #17120

Open
kara-louise opened this issue Sep 5, 2024 · 23 comments
Labels
good first issue github features these at https://github.com/nvaccess/nvda/contribute p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority triaged Has been triaged, issue is waiting for implementation.

Comments

@kara-louise
Copy link

Steps to reproduce:

Read the following line of characters

🅻🅸🅲🅴

Actual behavior:

The characters are not normalised. The spoken result depends on the synthesiser used. The Unicode character numbers are sent to a Braille display.

Expected behavior:

If normalisation is enabled, the characters should be normalised.

NVDA logs, crash dumps and other attachments:

n/a

System configuration

NVDA installed/portable/running from source:

installed

NVDA version:

NVDA version 2024.4beta2

Windows version:

Windows 10 Version 22H2 (OS Build 19045.4842)

Name and version of other software in use when reproducing the issue:

n/a

Other information about your system:

Other questions

Does the issue still occur after restarting your computer?

yes

Have you tried any other versions of NVDA? If so, please report their behaviors.

Same issue occurs with alpha-33832,9d15b169 (2025.1.0.33832)

If NVDA add-ons are disabled, is your problem still occurring?

yes

Does the issue still occur after you run the COM Registration Fixing Tool in NVDA's tools menu?

n/a

@Adriani90
Copy link
Collaborator

Adriani90 commented Sep 5, 2024

@LeonarddeR is it possible to extend the algorythm? This impacts other alphanumeric suplimements as well, e.g. regional indicator symbol letters, negative circled letters and some other symbols.
Here is the complete list:
https://en.wiktionary.org/wiki/Appendix:Unicode/Enclosed_Alphanumeric_Supplement

Most of the symbols in that list work perfectly though.

@ABuffEr
Copy link
Contributor

ABuffEr commented Sep 5, 2024

Hi,
not sure if it's useful but, always to extend the algorythm, I noticed that NVDA/Python 3.11.9 unicodedata.unidata_version returns 14.0.0, while this package currently bumps to 15.1.0.
Maybe it could be included as external dependency, to keep everything up-to-date.

@LeonarddeR
Copy link
Collaborator

It is possible to extend the algorithm by expanding textUtils.unicodeNormalize. What's the idea behind these negative squared letters? I don't think an update of unicodedata will normalize them properly.

@kara-louise
Copy link
Author

kara-louise commented Sep 6, 2024

What's the idea behind these negative squared letters?

@LeonarddeR I assume that the original use for them is for scientific notation since a lot of similar characters are prefixed with the word "mathematical". Not sure why this lot aren't though.
How they're used a lot these days (as in the example in my original comment) is another sort of "fancy text". IE to give the appearance of formatted text in places where you can't use it such as in social media screen names. I saw what I pasted in on Mastodon originally.
There are websites that will convert what you type into the Unicode characters of your choosing, and presumably that was one of the options.

@LeonarddeR
Copy link
Collaborator

@ABuffEr this unicodedata package received its last update over a year ago. I think it is unlikely that that package will fix these cases anyway.

@ABuffEr
Copy link
Contributor

ABuffEr commented Sep 6, 2024

@ABuffEr this unicodedata package received its last update over a year ago. I think it is unlikely that that package will fix these cases anyway.

I imagine because, accordint to this page, Unicode 15.1.0 is the latest version, released in 2023. On the other hand, the 14.0.0 dates back to 2021. Then, ok, I don't know whether this can make any difference here.

@Adriani90
Copy link
Collaborator

@LeonarddeR the enclosed alphanumeric supliments have been added to Unicode since version 5.2, but the last symbols have been added in 2020, so 14.0 should actually already contain all these symbols.
Usually they are used to make text stand out visually, in japanese context they are used as well very often, but also in cases such as when indicating country flags etc.

I guess if the normalization algorythm cannot handle them, we would have to add them to the symbols.dic file. right? I mean there are probably about 70 symbols that are not supported so far from this block.

@LeonarddeR
Copy link
Collaborator

Adding them to symbols.dic is an option yes, but that will still not normalize them when speaking by word or line.
I'd personnally create a Unicode normalization supplementary dictionary in code

sublement = {
    "🅻": "L",
    ...
}

Then feed that to str.maketrans and use the result of a call to str.translate in textUtils.unicodeNormalize.

@seanbudd
Copy link
Member

@LeonarddeR - I think these should just go in the symbols dictionary - I don't think a supplementary dictionary for normalization is going to be very maintainable

@seanbudd seanbudd added p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority good first issue github features these at https://github.com/nvaccess/nvda/contribute triaged Has been triaged, issue is waiting for implementation. labels Sep 10, 2024
@CyrilleB79
Copy link
Collaborator

Before discussing a solution, let's focus on the expected result.

Regarding the character "🅻":

  • if normalization is on, I'd expect it to be just replaced by an "L" character.
  • if normalization is off, I would to be able to know that it's not a normal "L". Either I can hear "negative squared latin L" (also to be translated unfortunately) or something reported by the synth (e.g. "letter 1F17B" with eSpeak) needs to be discussed. But in any case, it should be something different than when the normalization is on.

@SaschaCowley
Copy link
Member

Personally, my preference would probably be to have it replaced by "L" when doing anything but reading by character, but when reading by character have it read as "negative squared latin L".

@XLTechie
Copy link
Collaborator

XLTechie commented Sep 11, 2024 via email

@kara-louise
Copy link
Author

The Unicode ASCII add-on by Sukil Etxenike from the Spanish add-ons store is able to sort of normalise the above characters. I said sort of because they appear for some reason as "[L][I][C][E]".
I don't know what that add-ons doing differently than other normalisation tools such as the Unicode Normalization Test Page, which can't normalise them. So it might be worth investigating that add-on's source code to see how it works.

@Adriani90
Copy link
Collaborator

Adriani90 commented Sep 11, 2024 via email

@CyrilleB79
Copy link
Collaborator

Personally, my preference would probably be to have it replaced by "L" when doing anything but reading by character, but when reading by character have it read as "negative squared latin L".

Do you mean with normalization on? This would be inconsistent with the process applied to other normalized characters such as "𝑪". When normalization is enabled, we would expect for "🅻":

  • "L normalized" when navigating by character
  • "L" in all other cases, i.e. navigating by word, line, during say all, etc.

Adding "🅻" in a symbol file may allow to achieve something interesting for the user, but it's not a solution to achieve something similar to the UX seen on characters where the normalization is already working.

So if we want to discuss what can be heard when normalization is off (e.g. "negative square letter L"), I'd recommend to do it in a separate issue.

@LeonarddeR
Copy link
Collaborator

If these characters go in the symbol dictionary, this has nothing to do with normalization, and normalization will not work when reading by word or line.

@ABuffEr
Copy link
Contributor

ABuffEr commented Sep 11, 2024

Personally, my preference would probably be to have it replaced by "L" when doing anything but reading by character, but when reading by character have it read as "negative squared latin L".
Do you mean with normalization on? This would be inconsistent with the process applied to other normalized characters such as "𝑪".

In fact, you get "normalized C", completely missing that is a Bold Italic styled C. Too flatten in my opinion.
So, if I understand correctly, I agree with @SaschaCowley, even if it requires a change against the current situation.

@Adriani90
Copy link
Collaborator

In fact, you get "normalized C", completely missing that is a Bold Italic styled C.

That's exactly the expected behavior in this case. Bold, script, squared, circled or what so ever are details that are totally irelevant when reading the text with a screen reader usually, because these properties are in these special cases only for visual purposes. No one would pronounce these characters with their full unicode name. I agree that in some use cases like if you want to write a publication yourself and needs these characters to meet sighted users needs, then you need these properties, but then you can use the character info add-on to get the full unicode name. It is too much verbosity to make the pronounciation according to the unicode standard. That is the experience in Jaws and to be honnest it is horible to explore a publication with such alphanumeric characters and hearing the whole unicode names, even when navigating character by character which is sometimes needed.
So the current normalization style as it is in NVDA is the most convenient way to handle these characters. But if this is not achievable with theese alphanumeric supliments, I suggest we should try with the symbols.dic and get at least the characters announced in some situations.

@Adriani90
Copy link
Collaborator

An alternative would be to integrate character info add-on into NVDA itself and report full unicode name by pressing e.g. nvda+coma, but then we still don't have a database of unicode names that is fully translated into several languages.

@Adriani90
Copy link
Collaborator

Another alternative would be to retrieve formating of these characters from unicode and include them into the nvda+f command, so that you can get the formating of a character on demand as well, but this would be a huge workload I guess.

@CyrilleB79
Copy link
Collaborator

@ABuffEr and @Adriani90, have you read #17120 (comment)?

The initial request is that the normalization (as currently implemented in NVDA) also work with some more characters, namely the negative squared letters.

If you wish to discuss other topics, please, please, open a new issue. These new topics include:

  • improvements for the normalization feature
  • re-defining the name with which a character is reported (e.g. with an extra symbol file or using the existing one)
  • integrating some features of Character Information add-on in NVDA such as Unicode character description

Thanks.

@XLTechie
Copy link
Collaborator

XLTechie commented Sep 11, 2024 via email

@Adriani90
Copy link
Collaborator

So back to the issue, actually negative in this context means that the letter has an inverted color, so negative letters are white on a dark background. However, even this detail is not important for the screen reader user when reading the text, it is something that could be announced on demand by retrieving the full unicode name.

So if it is possible to include these letters in the normalization similar to all other alphanumeric letters already, this would be ideal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue github features these at https://github.com/nvaccess/nvda/contribute p4 https://github.com/nvaccess/nvda/blob/master/projectDocs/issues/triage.md#priority triaged Has been triaged, issue is waiting for implementation.
Projects
None yet
Development

No branches or pull requests

8 participants