Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amend RFC2603 to allow mangled identifiers to start with a digit. #2705

Merged
merged 1 commit into from
Oct 23, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 19 additions & 23 deletions text/2603-rust-symbol-name-mangling-v0.md
Original file line number Diff line number Diff line change
Expand Up @@ -506,7 +506,7 @@ mod gödel {
would be mangled as:

```
_RNvNtNtC7mycrateu8gdel_Fqa6escher4bach
_RNvNtNtC7mycrateu8gdel_5qa6escher4bach
<-------->
Unicode component
```
Expand Down Expand Up @@ -613,10 +613,10 @@ compiler generates mangled names.

The syntax of mangled names is given in extended Backus-Naur form:

- Non-terminals are within angle brackets (as in `<name-prefix>`)
- Non-terminals are within angle brackets (as in `<path>`)
- Terminals are within quotes (as in `"_R"`),
- Optional parts are in brackets (as in `[<decimal>]`),
- Repetition (zero or more times) is signified by curly braces (as in `{<name-prefix>}`)
- Optional parts are in brackets (as in `[<disambiguator>]`),
- Repetition (zero or more times) is signified by curly braces (as in `{<type>}`)
- Comments are marked with `//`.

Mangled names conform to the following grammar:
Expand All @@ -641,11 +641,13 @@ Mangled names conform to the following grammar:
<impl-path> = [<disambiguator>] <path>

// The <decimal-number> is the length of the identifier in bytes.
// <bytes> is the identifier itself and must not start with a decimal digit.
// <bytes> is the identifier itself, and it's optionally preceded by "_",
// to separate it from its length - this "_" is mandatory if the <bytes>
// starts with a decimal digit, or "_", in order to keep it unambiguous.
// If the "u" is present then <bytes> is Punycode-encoded.
<identifier> = [<disambiguator>] <undisambiguated-identifier>
<disambiguator> = "s" <base-62-number>
<undisambiguated-identifier> = ["u"] <decimal-number> <bytes>
<undisambiguated-identifier> = ["u"] <decimal-number> ["_"] <bytes>

// Namespace of the identifier in a (nested) path.
// It's an a-zA-Z character, with a-z reserved for implementation-internal
Expand Down Expand Up @@ -775,29 +777,22 @@ and, for now, only define a mangling for integer values.
### Punycode Identifiers

Punycode generates strings of the form `([[:ascii:]]+-)?[[:alnum:]]+`.
This is problematic for two reasons:
This is problematic because of the `-` character, which is not in the
supported character set; Punycode uses it to separate the ASCII part
(if it exists), from the base-36 encoding of the non-ASCII characters.

- Generated strings can contain a `-` character; which is not in the
supported character set.
- Generated strings can start with a digit; which makes them clash
with the byte-count prefix of the `<identifier>` production.

For these reasons, vanilla Punycode string are further encoded during mangling:

- The `-` character is simply replaced by a `_` character.
- The part of the Punycode string that encodes the non-ASCII characters
is a base-36 number, using `[a-z0-9]` as its "digits". We want to get
rid of the decimal digits in there, so we simply remap `0-9` to `A-J`.
For this reasons, we deviate from vanilla Punycode, by replacing
the `-` character with a `_` character.

Here are some examples:

| Original | Punycode | Punycode + Encoding |
|-----------------|-----------------|---------------------|
| føø | f-5gaa | f_Fgaa |
| α_ω | _-ylb7e | __ylbHe |
| 铁锈 | n84amf | nIEamf |
| 🤦 | fq9h | fqJh |
| ρυστ | 2xaedc | Cxaedc |
| føø | f-5gaa | f_5gaa |
| α_ω | _-ylb7e | __ylb7e |
| 铁锈 | n84amf | n84amf |
| 🤦 | fq9h | fq9h |
| ρυστ | 2xaedc | 2xaedc |

With this post-processing in place the Punycode strings can be treated
like regular identifiers and need no further special handling.
Expand Down Expand Up @@ -1154,3 +1149,4 @@ pub static QUUX: u32 = {
- Resolve question of complex constant data.
- Add a recommended resolution for open question around Punycode identifiers.
- Add a recommended resolution for open question around encoding function parameter types.
- Allow identifiers to start with a digit.