Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rustc-demangle does not allow encoding version in symbol name #58

Open
EFanZh opened this issue Nov 13, 2021 · 7 comments
Open

rustc-demangle does not allow encoding version in symbol name #58

EFanZh opened this issue Nov 13, 2021 · 7 comments

Comments

@EFanZh
Copy link

EFanZh commented Nov 13, 2021

The RFC 2603 says that

// The <decimal-number> specifies the encoding version.
<symbol-name> = "_R" [<decimal-number>] <path> [<instantiating-crate>]

But it seem that rustc-demangle forces an upper case letter after _R:

rustc-demangle/src/v0.rs

Lines 56 to 59 in 2811a1a

match inner.as_bytes()[0] {
b'A'..=b'Z' => {}
_ => return Err(ParseError::Invalid),
}

But I am not sure whether this issue belongs to RFC 2603 or rust-demangle.

@yjhn
Copy link

yjhn commented Jul 7, 2022

If I understand correctly, [<decimal-number>] here is used to specify the mangling version. Since only V0 mangling currently exists (legacy mangling is not relevant here), it is assumed that all encodings use it.

@EFanZh
Copy link
Author

EFanZh commented Jul 7, 2022

I mean rustc-demangle and RFC 2603 should be consistent. In theory, a third-party library could generate a symbol according to RFC 2603, but it can not be recognized by rustc-demangle. I think either rustc-demangle or RFC 2603 should be modified so that they behave the same way.

@eddyb
Copy link
Member

eddyb commented Jul 7, 2022

Just to be perfectly clear, [<foo>] in EBNF indicates that <foo> is optional - IMO <foo>? (or just foo?) would be clearer for anyone familiar with regexes, but I didn't write that grammar.

But I think the actual confusion here is a bit weirder - technically the <path> [<instantiating-crate>] part of the grammar isn't actually valid when the <decimal-number> version is present.

So it's more like the grammar is this (different syntax so I can illustrate things):

SymbolName = "_R" version:DecimalNumber? SymbolContents(version)
SymbolContents("") = Path InstantiatingCrate?
SymbolContents("0") = # not defined yet
SymbolContents("1") = # not defined yet
# ...

That is, something like _R0 (which would be confusing but consistent with us encoding the number 0 as no characters, and the number 1 as the character 0, elsewhere in the grammar) or _R1 wouldn't be followed by the existing grammar, but would be their own encodings.

The only reason the RFC specifies that part is to make it clear that:

  • "_R" <decimal-number> is reserved for future encoding versions (not specified by RFC 2603)
  • a valid symbol of the v0 grammar isn't allowed to overlap with those reserved encodings, i.e. <path> cannot start with a <decimal-number>

cc @michaelwoerister (for extra confirmation in case I'm misremembering something)

@michaelwoerister
Copy link
Member

Yes, I agree with @eddyb's assessment. The version tag is supposed to allow for evolving the symbol mangling grammar. At minimum it should quickly tell a demangler if it can handle the symbol. We shouldn't have made it optional though. Instead, we should have required a hard-coded 0 there for v0.

In practice v0 requires there to be no version number. The presence of a version number means that it's some newer grammar version.

In retrospect I'm not sure how useful the whole "versioning scheme" is 😅

@eddyb
Copy link
Member

eddyb commented Jul 7, 2022

In retrospect I'm not sure how useful the whole "versioning scheme" is

I've suggested using it to specify compressed encodings, where there might be a small risk of the compressed form accidentally overlapping with a valid (but likely nonsensical) mangling, though that's just as easily accomplished by leaving the [_0-9a-zA-Z] charset just after _R (some kind of unicode shenanigans could be fun but may cause weird issues just as well, lol).

But yeah, nothing on the immediate horizon, for now we can keep extending v0 as long as we don't cause overlaps.

@michaelwoerister
Copy link
Member

True, for compressed schemes we could use it pretty soon. E.g. define _R0<128-bit-hash-in-base64>.

About extending v0, I'm not so sure. If we add new grammar productions, e.g. for const generics, demanglers have no way of gracefully failing there (e.g. by skipping just the part they don't understand). That question came up in rust-lang/rust#97571 and I don't have a good answer for it.

@eddyb
Copy link
Member

eddyb commented Jul 7, 2022

To be clear, I was talking about reversible compression, not hashing.

E.g. zstd with aggressive settings, and using an "official" dictionary trained from a large symbol dataset. Also, this would require symbols to be able to contain arbitrary bytes (although UTF-8 might be an issue), so it wouldn't be available on all platforms, etc. (and even the uncompressed data could probably take advantage of not being limited to the original charset, so that it fits byte-oriented compression even better).

Such ideas came up during the RFC (as a way to get smaller symbols but keep all the information), but were postponed - with that context, my suggestion was merely to still do something like that, but take advantage of the version field to distinguish it from uncompressed v0 symbols.


About extending v0, I'm not so sure.

I mean, I hope we don't have to, but IMO it's less of an annoyance to work around a few symbols (the ones using the newer features) not getting demangled (by passing them through e.g. up-to-date rustfilt) than getting lossy output from the older demangler.

OTOH, once you have unsigned integers and (path) constructor application you can represent pretty much any constant, like if we didn't have the n prefix for negative signed integers we could use, say, minus(5) or core::ops::Neg::neg(5) (it's ridiculous, but it would be encoding unique information).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants