Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-ASCII identifiers #2455

Closed
wants to merge 1 commit into from
Closed

Conversation

DemiMarie
Copy link

@DemiMarie DemiMarie commented Jun 2, 2018

@pyfisch
Copy link
Contributor

pyfisch commented Jun 2, 2018

I doubt that the confusability check can be implemented effectively. There are just too many characters in Unicode that look confusingly similar. Often they are from different scripts like latin a and russian а. One way to limit confusion would be to allow only one script in a identifier as it is done for international URLs.

On the other hand I doubt that homograph attacks will be a problem as you already need to trust the people who write your code.

@BurntSushi
Copy link
Member

This renders Rust vulnerable to homograph attacks, which could be used to sneak code past code review.

Are there other languages with permissive Unicode identifiers that we could survey? How prevalent are homograph attacks?

Separately from attack vectors, I like the idea itself though, particularly as a way to improve failure modes.

@SimonSapin
Copy link
Contributor

SimonSapin commented Jun 2, 2018

This RFC is incomplete IMO.

The reference-level section needs a detailed specification of the proposed behaviors. NFC is unambiguous enough but “the current set of allowed characters” means “whatever happened to be implemented unstable in June 2018” and doesn’t cut it. Similarly for confusables. https://unicode.org/reports/tr31/ and https://www.unicode.org/reports/tr39/#Confusable_Detection are relevant, but they each define multiple variations to pick from.

(Edit: it also needs discussion of what happens when the compiler updates to a new version of Unicode: #2455 (comment).)

Motivation and alternatives should be discussed at a finer granularity. Why error on non-NFC rather than have the parser do the normalization itself and have the compile compare normalized identifiers in name resolution? Many people are not familiar Unicode normalization, what do we expect them to do with a such compile error? Even for those who are, many text editors do not provide a way any UI for normalization.

Regarding confusables, I agree with @pyfisch that the threat model of homograph attack during code review seems dubious. If however we mean to help people who accidentally type an ident that doesn’t compare equal to what they expect, why not only look for confusables in diagnostics, without changing whether a compiler error happens in the first place?

The same-file boundary seems arbitrary. Calling a method defined in another file happens all the time.

Finally, for this feature in particular I would like to see a lot more research in the prior art section. Different programming languages do slightly different things here (e.g. rust-lang/rust#28979 (comment) and rust-lang/rust#28979 (comment)). Before we stabilize this I’d like to find out what those differences are in a wide range of programming languages and why they made those decisions. Some choices might be arbitrary, but others might be guided by reasons that would also apply to Rust. Yes this is a lot of work, and yes I haven’t been doing that work myself either. But I think this is what would produce the best outcome.

@shingtaklam1324
Copy link

shingtaklam1324 commented Jun 2, 2018

Unicode characters out of the ASCII range can be harder to type, and potentially harder to read. For most characters in Unicode, there is a representation using only ASCII characters, (ie. θ => theta).

I do recognize that in some cases, (Maths/ Science) libraries, it can make the code easier to read, but to be fair, there is a lot more to this than to simply handle collisions of symbols which look similar.

There is also this discussion on /r/rust which shows one case of this (in Go), and the community opinion on this topic.


Edit:

Doing some more research, here is the PEP for it, and they have pretty good justification for why it should be allowed, so now I'm quite torn on the issue, as it would make sense to have the internationalization.


Edit2: I derped and added the link to Idris without reading it properly. Anyways... Removed that link now

@Centril Centril added the T-lang Relevant to the language team, which will review and decide on the RFC. label Jun 2, 2018
@repax
Copy link

repax commented Jun 2, 2018

I'm somewhat concerned about the potential for homograph attacks and think that we should be biased towards caution. A conservative alternative is to introduce a select few sets of characters rather than trying to come up with a complete and foolproof solution.

- What parts of the design do you expect to resolve through the RFC process before this gets merged?
- What parts of the design do you expect to resolve through the implementation of this feature before stabilization?
- What related issues do you consider out of scope for this RFC that could be addressed in the future independently of the solution that comes out of this RFC?
- How can the confusability check be implemented? We should have a fast path when there are *no* non-ASCII identifiers, but we also must be fast (at worst, O(N log N)) when there *are* non-ASCII identifiers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be the core of the detailed design, not an unresolved question, IMO.

@Serentty
Copy link

Serentty commented Jun 2, 2018

@shingtaklam1324 For me, the biggest reason to add this is not so that programmers can type Greek letters for mathematical variables (such as ∆ for delta), but so that identifiers can be written in any language. I'm not opposed to the use of a standard language for open source projects to facilitate collaboration, but I do think that developers should still be given a choice, instead of being told what's best for them.

@DemiMarie
Copy link
Author

@repax My proposal is that any attempt at a homograph attack automatically results in an error message. That is why it requires the compiler to check for confusability.

@pyfisch I believe that a confusability check can be done efficiently. After normalizing to NFC, replace every character in every identifier by the lowest-valued character it could be confused with. If any identifiers are the same now that were not the same before, report an error.

@pyfisch
Copy link
Contributor

pyfisch commented Jun 2, 2018

@DemiMarie After reading the Unicode Technical Standard #39 on confusable detection mentioned by @SimonSapin I agree that there is an efficient algorithm. But I infer from the RFC and your comments that you are either unaware of the algorithm and the associated table or that you want to use a different confusable detection function. Either way I do agree with @SimonSapin and @sfackler that the RFC is incomplete and

The reference-level section needs a detailed specification of the proposed behaviors.

If a table based solution like the one from Unicode Technical Standard #39 is used detected confusables should not be a hard error but a warning as the confusion tables are expected to improve and erroring on additional characters may break existing programs.

@burdges
Copy link

burdges commented Jun 2, 2018

Didn't we address this in #1565 ? ;)

In all seriousness, I do believe homograph attack against code review sounds plausible, so ideally anything like this should live behind a feature gate, even after stabilization.

There is no defense in restricting the script per identifier like @pyfisch mentioned. Also, you cannot do the confusability checks at the file level because this breaks using std. I think the feature gate itself could enable specifics scripts though

#[allow_unicode_script(...)]

Also, error messages should say exactly what script to enable, making this painless. We could do crate level confusability checks based on the scripts enabled.

We should likely allow use some_crate::foreign_ident as local_ident with foreign_ident outside the allowed scripts, but local_ident must be allowed of course.

@est31
Copy link
Member

est31 commented Jun 3, 2018

I'm a german and even German script, even though it uses the latin alphabet, has characters that are not in ascii (äöüß). Most languages that use the latin alphabet have. And of course, there are so many other languages that don't use the latin alphabet. I think it is a legitimate demand that you shall be able to code in your native language and script. So generally I'm in favour of RFC.

Note that even though I agree there should be an option, I don't think it is a good idea for the code I am writing. As a german, I could make e.g. the comments in my public codebases in german. Like so many other non-native english open source developers, I chose to make them in english instead. This is highly useful because the number of german speakers is so much smaller than the number of english speakers. I can accept PRs from people in france, the USA, etc. They all understand what my code is doing, I understand what their code is doing.

From a beauty aspect, I prefer consistent codebases. Most if not all of the crates, all of std, all the keywords, all of the rustc error messages are in english. Starting to use german would create a half english half german codebase which is something I don't find appealing. My personal view at least. Even if you have translated error messages, this causes googleability to suffer dramatically.

Another advantage of keeping non ascii identifiers outside of codebases is that everyone with an ascii capable keyboard can easily edit stuff. Greek letters look nice but they are hard to edit. Imagine someone non technical (and non greek) doing their first steps in programming and they see µ and wonder how to type that weird u thing.

So I agree with @Serentty that the main use case of this proposal should be internationalisation. This is a great goal. Rust should be international and there are programming communities which use different languages as lingua franca than english.

Currently, the RFC intends to simply turn off the ascii ident check, without any opt-in or something else. Basically, you can accidentially start using ascii identifiers even though you might want to keep your codebase ascii only. I agree with @burdges that this should be different.

I would love the default to be to allow ascii only idents only and error on non ascii idents using a default-error lint. Anyone who works on non-english codebases then can turn off that lint.

The idea of making the lint customizeable using scripts is really awesome. It would mean that the lint is still useful, even for those who use a native script. Generally if you code on a codebase that uses german idents, you don't want e.g. russian/bulgarian script or idk runes or something.

@shingtaklam1324
Copy link

Unicode Technical Report #31 is also worth a read. It has a list of suggested scripts to allow and disallow, which I think we should consider, and not just allow all Unicode Characters like some other languages do. This could be set up in a way like lint levels, and set up specific lints for the different character sets with different warnings by default.

@mark-i-m
Copy link
Member

mark-i-m commented Jun 3, 2018

I would be unhappy if this was added for a few reasons:

  • (I'm obviously biased on this point) I think it would be good to nudge the use of English identifiers. This will, I think, preserve the usability of libraries by the whole ecosystem. For example, I personally am extremely unlikely to use a library is there docs are in another language. Does anyone know if this has been a problem in other languages like Swift or Julia?

  • Non-ascii idents are a pain to type. I spent a lot of time the last few months writing Julia, where this seems to be more common practice. Typing these idents significantly reduces ergonmics, and I usually end up doing some sort of error-prone copy/pasting. This is even worse if you use a terminal-based editor.

  • Not all terminals/fonts display all Unicode chars correctly. This means that people like me who use vim (or... emacs) might have added pain. Moreover any compiler errors that contain special characters might look weird.

  • It might be possible to do weird things with hidden characters. That could be a security problem. Is this covered by the confusability check proposed?

@Centril
Copy link
Contributor

Centril commented Jun 3, 2018

I agree with @SimonSapin agreeing with @pyfisch that the risk of homograph attacks / risks to security is dubious. It seems to me that there are far more risky things security wise than unicode idents.

During my first year as a computer science student, we had two professors who insisted on talking about CS terms using their Swedish names and used Swedish in comments and variable names. Most students, as far as I recall, really disliked the experience, because they were not learning the international terms then, so they had to re-learn the terms in English. These students, among them myself, were not native English speakers, however, knowledge of English is expected in Swedish society. Therefore, I would not use Swedish in any code nor accept it in my projects.

That said, it seems important to me to support the internationalization of Rust because English is far from a lingua franca across the world. So I support the principle of non-ASCII idents.

To satisfy both those who want to avoid non-ASCII idents in their code and those who want to use it, I propose two lints (either in clippy or the compiler..):

non_ascii_ident       // allow by default
pub_non_ascii_ident   // deny by default

The first allows you to use non-ASCII idents in non-exported code.
The second allows you to use them in public code.

@SimonSapin
Copy link
Contributor

SimonSapin commented Jun 3, 2018

Another concern is what happens when we update the compiler’s version of the Unicode character database. I think we don’t want to get stuck on an old version forever, but we also don’t want to start emitting errors for programs that used to compile correctly with a previous rustc. This requires some careful tweaks to the algorithm, see for example https://unicode.org/reports/tr15/#Stabilized_Strings.

@mark-i-m
Copy link
Member

mark-i-m commented Jun 3, 2018

One more concern: FFI. What happens if I no_mangle an ident with a funny character? I'm guessing this should be disallowed?

@est31
Copy link
Member

est31 commented Jun 3, 2018

@mark-i-m we could start off by forbidding no_mangle and extern fn's with such names.

@scottmcm
Copy link
Member

scottmcm commented Jun 3, 2018

For the upgrading tables question: this seems like something that could be done on epoch boundaries. Always warn using the latest tables, and error using the latest table when the epoch locked.

(I'm not sure that's better than a deny-by-default lint that always uses the latest tables, though, which I think would be allowed under the stability guarantee.)

This is the technical portion of the RFC. Explain the design in sufficient detail that:
This RFC proposes that non-ASCII identifiers be stabilized, with the current set of allowed characters. However, there are restrictions to avoid visual confusion:

* No two identifiers that are distinct but visually confusable may be in the same scope.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this interact with use foo::*; use bar::*;? Does this mean that minor version updates to crates can stop my code compiling because they introduce distinct-but-confusable names?

@the8472
Copy link
Member

the8472 commented Jun 3, 2018

Maybe it would make sense to require or encourage ascii aliases for non-ascii identifiers?

This could solve many problems:

  • Latin-only keyboard? just input the ascii version and let RLS suggest the unicode version
  • Error messages could either refer to the unicode or the ascii version
  • code-bases could elect whether they want to consume an API in the unicode or the ascii version
  • FFI could always use the ascii alias

@pyfisch
Copy link
Contributor

pyfisch commented Jun 3, 2018

I've written an alternative proposal at #2457 that offers more a precise reference-level explanation and explores several other designs.

@DemiMarie
Copy link
Author

Closing in favor of #2457, which is vastly better written.

@DemiMarie DemiMarie closed this Jun 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-lang Relevant to the language team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.