Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for SipHash128 with all-zero key #40

Open
briansmith opened this issue Jun 20, 2024 · 13 comments
Open

Add support for SipHash128 with all-zero key #40

briansmith opened this issue Jun 20, 2024 · 13 comments

Comments

@briansmith
Copy link

briansmith commented Jun 20, 2024

I referenced your work on the ASCII identifier MD5 collision in rust-lang/rust#10389 (comment). It would be cool if we could extend your tools to also work on SipHash128 with an all-zero key being used as a hash function. Is that something that you would accept? Any tips for contributing it?

@briansmith
Copy link
Author

See also discussion in typst/comemo#3.

Note that I'm suggesting adding support for the weaker SipHash 1-3 with 128-bit output, not the default stronger SipHash 2-4 with the default 64-bit output.

@cr-marcstevens
Copy link
Owner

Does rust currently uses 64-bit TypeIds ?

@cr-marcstevens
Copy link
Owner

What would happen if there was a collision?

@bjorn3
Copy link

bjorn3 commented Jun 20, 2024

Does rust currently uses 64-bit TypeIds ?

It used to, but has been changed to 128-bit a while back.

What would happen if there was a collision?

You can transmute one type to another if they have the same TypeId hash even if doing so would violate memory safety.

@cr-marcstevens
Copy link
Owner

Only transmutation by choice? But no bad side consequences that could creep up on people?

@briansmith
Copy link
Author

Does rust currently uses 64-bit TypeIds ?

  • Rust originally used the original SipHash 2-4 with 64-bit output for TypeId. I believe somebody demonstrated a collision so it switched to SipHash 2-4 128-bit output. Then it switched to SipHash 1-3 with 128-bit output.

What would happen if there was a collision?

As @bjorn3 said, various "safe" APIs in Rust, basically anything that uses TypeId or Any in Rust, would become unsound; i.e. there could be memory-safety issues even in code that doesn't use unsafe if they use those APIs with types that have collisions in their TypeIds.

See rust-lang/rust#10389 (comment) for an example from back when the shorter TypeIds were used.

@briansmith
Copy link
Author

Only transmutation by choice? But no bad side consequences that could creep up on people?

The Any and TypeId APIs are easily reachable from anything that "downcasts", e.g. https://doc.rust-lang.org/stable/std/io/struct.Error.html#method.downcast. Those don't even mention Any or TypeId.

The SipHash-based hasher is also used for some equality comparisons within the compiler, so potentially when compiling one type could be silently substituted for another. So in theory just compiling two crates (or a single crate) that defines types with typeid collision could result in a miscompilation.

There is effort underway to further use SipHash for other uses, e.g. rust-lang/cargo#13171 (comment). I find the discussion about that a bit confusing still.

@cr-marcstevens
Copy link
Owner

But a collision of course must follow the apropriate format of how types are converted to a message input to siphash. Is there a (simple) description of this format?

@briansmith
Copy link
Author

briansmith commented Jun 20, 2024

See also rust-lang/rust#41215 which seems to imply that it is used in the incremental compilation cache.

See also the use in the mold linker: rui314/mold@fa8e95a. My understanding is that when Siphash(k, x) = Siphash(k, y) all references to the the object code y will be replaced to instead refer to the object code x (i.e. x and y will be "folded" together as identical code). IIUC from the twitter discussion, mold may use a random key.

@bjorn3
Copy link

bjorn3 commented Jun 20, 2024

Is there a (simple) description of this format?

It depends on the rustc version and isn't directly written down in the rustc source code but created using code generation while compiling rustc (to be precise HashStable is derived using a proc macro). The format seems to roughly be the following:

Take the index of the TyKind variant and hash it as 64-bit(?) little endian number (counting from zero). Then hash all the fields. For simple enums recurse the same way. For DefId, look up the DefPathHash (which should have collisions detected already in the absence of dylibs) and hash the resulting 128-bit hash stored in the DefPathHash. For any Ty you can hash the inner TyKind. For RawList hash the length of the list as 64-bit little endian number followed by hashing all elements in order. For other types you will have to look at the StableHash implementation or just ignore the respective TyKind variant if the rest of the variants give you enough freedom in picking a hash input with the properties you need for a hash collision.

@briansmith
Copy link
Author

How hard would it be to create a program that takes a source code representation of a $ty and produces the bytes that are fed into SipHasher? I am not familiar with the compiler enough to know if it is feasible to extract a small bit of code from the compiler or re-implement to produce such a thing.

@bjorn3
Copy link

bjorn3 commented Jun 20, 2024

If you ignore DefPathHash it should be reasonably straight forward if a bit tedious to write code that acts the exact same way. For DefPathHash (which is used for extern types as well as closures, generators, and functions definitions) the answer is less straight forward. The DefPathHash consists of two halves. The first is the StableCrateId, a 64-bit hash of the crate name, if the crate is a library or not, all -Cmetadata arguments (after sorting and deduplicating) and the rustc version string. This is generated at https://github.com/rust-lang/rust/blob/cb8a7ea0ed866295e0f65725cea6662bea51971a/compiler/rustc_span/src/def_id.rs#L146-L189 Rustc also checks that no two crates with the same StableCrateId are loaded at compile time, which ensures it is unique at runtime too if you ignore dylibs. The other half of the DefPathHash is a hash which identifies a definition within a single crate. Rustc enforces this uniqueness too. It is created across https://github.com/rust-lang/rust/blob/cb8a7ea0ed866295e0f65725cea6662bea51971a/compiler/rustc_hir/src/definitions.rs (search for DefPathHash::new)

@briansmith
Copy link
Author

Another motivating example: rust-lang/cargo#6529 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants