Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representation of and operations on pointers and usize #251

Open
mahkoh opened this issue Oct 11, 2020 · 37 comments
Open

Representation of and operations on pointers and usize #251

mahkoh opened this issue Oct 11, 2020 · 37 comments
Labels
C-support Category: Supporting a user to solve a concrete problem

Comments

@mahkoh
Copy link

mahkoh commented Oct 11, 2020

I have a few questions regarding usize and raw pointers for sized T:

  1. Is usize guaranteed to have the same layout as one of the uN?
  2. Is usize guaranteed to have no padding bits?
  3. Is usize as *mut T as usize the identity function on usize?
  4. Is usize as *mut T guaranteed to be the same as transmute::<usize, *mut T>?
  5. Is ptr::read_unaligned::<*mut T> safe for all arguments for which ptr::read_unaligned::<[u8; sizeof(*mut T)]> is safe?
  6. Is usize-literal as *mut T guaranteed to have no special behavior? (e.g. for 0usize)

The current documentation

does not answer these questions afaict.

@RalfJung
Copy link
Member

RalfJung commented Oct 11, 2020

Is usize guaranteed to have no padding bits?

Yes, all our integer types are.

Is usize as *mut T as usize the identity function on usize?
Is usize as *mut T guaranteed to be the same as transmute::<usize, *mut T>?

These are contingent on details of pointer provenance and how it interacts with integer-pointer cats... so the answer is "we don't know". FWIW, that is also the answer for LLVM IR, and likewise in C/C++. (See this for some of the recent work on the C/C++ side, but I hope we can find something cleaner for Rust.)

The opposite direction, *mut T as usize, will most likely not be equivalent to a transmute -- in fact so far there is no good proposal on the table for how to make the transmute not UB, which is rather unsatisfying. (This paper is one of the most recent proposals.)

Is ptr::read_unaligned::<*mut T> safe for all arguments for which ptr::read_unaligned::<[u8; sizeof(*mut T)]> is safe?

No, certainly not. For example, if T := bool, then the former is UB is the value you are loading is not a valid bool.

Is usize-literal as *mut T guaranteed to have no special behavior? (e.g. for 0usize)

Yes.

@RalfJung RalfJung added the C-support Category: Supporting a user to solve a concrete problem label Oct 11, 2020
@mahkoh
Copy link
Author

mahkoh commented Oct 11, 2020

Thanks for the links I'll look at them later.

Is ptr::read_unaligned::<*mut T> safe for all arguments for which ptr::read_unaligned::<[u8; sizeof(*mut T)]> is safe?

No, certainly not. For example, if T := bool, then the former is UB is the value you are loading is not a valid bool.

ptr::read_unaligned::<*mut T> produces a *mut T not a T.

IIRC, pointers actually have trap representations on certain architectures in C. Since raw pointers in Rust are somewhat relaxed compared to C pointers, can we guarantee that any bit pattern is a valid raw pointer in Rust?

@RalfJung
Copy link
Member

RalfJung commented Oct 11, 2020

ptr::read_unaligned::<*mut T> produces a *mut T not a T.

Oh sorry, I misread. So regarding your question then, I think the answer is yes -- everything that's a valid integer will also be a valid pointer. The other way around might not be true though, that is again about pointer provenance.

But really the proper answer is that Rust doesn't have a spec yet that can answer such detailed questions, sorry. :/

Since raw pointers in Rust are somewhat relaxed compared to C pointers, can we guarantee that any bit pattern is a valid raw pointer in Rust?

No -- see #71 for the related discussions on validity of integers. ("Trap representations" are not a thing in Rust, but we have a related concept of "invalid values".) If uninitialized integers are invalid, then uninitialized raw pointers will also be invalid.

@digama0
Copy link

digama0 commented Oct 11, 2020

The opposite direction, *mut T as usize, will most likely not be equivalent to a transmute -- in fact so far there is no good proposal on the table for how to make the transmute not UB, which is rather unsatisfying. (This paper is one of the most recent proposals.)

Forgive my naivete, but what are the blockers behind both directions being trivial? If raw pointers are just treated as integers, then these casts become trivial to specify, and all the interesting work goes into &mut T -> *mut T and *mut T -> &mut T instead.

@RalfJung
Copy link
Member

RalfJung commented Oct 11, 2020

Pointers in LLVM are not integers, so we cannot make them integers in Rust either. Potentially we could compile raw pointers differently, not using LLVM pointers, but I am not sure how well that works (and LLVM semantics in this area are so unclear, not to say buggy, that there is no telling if that would actually help).

@RalfJung
Copy link
Member

RalfJung commented Oct 11, 2020

Oh also, Rust functions like offset and wrapping_offset clearly show that the intent in Rust is for raw pointers to carry provenance. The goal, as far as I understand it, is that raw pointers in Rust are like pointers in C -- and those are complicated.

I like the idea of only references having provenance, but I think it is unfortunately unrealistic. It would be good to have more data on this though, like numbers for what the perf impact would be if we used LLVM integers to compile Rust raw pointers. Also see these provenance-related discussions in the UCG.

@digama0
Copy link

digama0 commented Oct 11, 2020

I will grant that pointers are complicated, but it seems that pointer provenance already gets wrapped up in discussions about "what's in a byte" even when only talking about integer types. So it seems like we can just make raw pointers exactly as "provenant" as usize and then the cast is again trivial, notwithstanding the unsolved problem of exactly how much provenance to carry around through both raw pointers and integers.

@RalfJung
Copy link
Member

RalfJung commented Oct 11, 2020

I am working under the assumption that usize does not have provenance. This assumption is deeply wired into many optimizations, e.g. when using if x == y { ... as indicator that we can replace x with y inside the conditional. (Even if they are equal integers, their provenance might differ.) Moreover, you lose some arithmetic properties, depending on how exactly provenance propagates through arithmetic -- you lose associativity and/or commutativity, x * 0 is not the same as 0, ...

The big advantage of pointers is that their only "arithmetic" is "add an integer", so it is easy to say what happens to provenance: the result has the same provenance as the left input (the only pointer input). But provenance needs to end when pointers are cast to integers, and something needs to be done about pointers being transmuted to integers. The easiest option would be to declare it UB, and C's strict aliasing even almost does that. It could also behave like uninitialized integers, but it is unclear if that is any better.

This C++ proposal has more detail; and here is even more material.

So it seems like we can just

Quite a few people have thought about these problems for years; there does not seem to be an easy solution that we can "just" use. ;)

@digama0
Copy link

digama0 commented Oct 11, 2020

What is the reason that all this mess has to be in the spec instead of just in the compiler though? SB already gives a semantics that tracks where you are or are not allowed to write, and once you are in raw pointer land most of the value tracking turns off. If the compiler can infer some provenance information, great, but the spec doesn't need to play that game.

Put another way, what is a program that should be UB that SB says is fine, and relies on tracking pointer provenance through raw pointers and/or integers?

This assumption is deeply wired into many optimizations, e.g. when using if x == y { ... as indicator that we can replace x with y inside the conditional.

Isn't there an LLVM bug (that you filed, I think) that does this with pointers?

@sollyucko
Copy link

Is usize guaranteed to have the same layout as one of the uN?

Cases where this might not be true include:

  • eZ80 in ADL mode (24-bit registers and addresses)
  • Possibly segmented x86 (AFAICT 16 bit registers and 21 bit addresses (unless represented as 32 bits using a segment value and offset)?)
  • Motorola 56000 (24/48/56-bit registers, 16-bit addresses?)

@digama0
Copy link

digama0 commented Oct 11, 2020

I would hope that we can at least ensure that it has the same layout pattern as uN where N is the number of bits in the word; that is, if the word size is 24 bits then usize = u24, if such a thing existed. I suppose the only non obvious property is alignment, but that probably has to be specified by the target architecture in any case.

@RalfJung
Copy link
Member

RalfJung commented Oct 11, 2020

What is the reason that all this mess has to be in the spec instead of just in the compiler though?

Because so far nobody has been able to propose a semantics that can hide this from the spec, but still do all the desired optimizations and at the same time provide pointer-integer casts.

If the compiler can infer some provenance information, great, but the spec doesn't need to play that game.

It does, though. Stacked Borrows relies on this provenance information, and that effect cannot be entirely hidden from the semantics in a language with pointer-integer casts.

Put another way, what is a program that should be UB that SB says is fine, and relies on tracking pointer provenance through raw pointers and/or integers?

Stacked Borrows "gives up" on raw pointers (treating them as having much less provenance -- but there is still some provenance left, namely to track the allocation ID that the pointer originally pointed to), but that is intended as a temporary stepping-stone. #86 contains some examples of optimizations that we want to do, and that LLVM does, but that SB currently fails to support because it forgets provenance for raw pointers.

I already mentioned offset as a function whose UB can only be modeled if raw pointers have provenance.

This assumption is deeply wired into many optimizations, e.g. when using if x == y { ... as indicator that we can replace x with y inside the conditional.

Isn't there an LLVM bug (that you filed, I think) that does this with pointers?

Do you mean this one? Yes, LLVM does this with pointers and that's wrong. The goal is to make it wrong for pointers but right for integers.

@mahkoh
Copy link
Author

mahkoh commented Oct 13, 2020

@RalfJung

The goal, as far as I understand it, is that raw pointers in Rust are like pointers in C

I do not believe so. I've taken a quick look at annex J (unspecified/undefined/implementation-defined behavior) of C11:

Operation C Pointer Rust raw pointer Rust reference
The value of a pointer to an object whose lifetime has ended is used 👍
Conversion between two pointer types produces a result that is incorrectly aligned 👍
Addition or subtraction of a pointer into, or just beyond, an array object and an integer type produces a result that does not point into, or just beyond, the same array object 👍 (wrapping_add) n/a
Addition or subtraction of a pointer into, or just beyond, an array object and an integer type produces a result that points just beyond the array object and is used as the operand of a unary * operator that is evaluated 👍 (wrapping_add, if the pointer is valid) n/a
Pointers that do not point to the same aggregate or union (nor just beyond the same array object) are compared using relational operators 👍 n/a
The value of a pointer that refers to space deallocated by a call to the free or realloc function is used 👍
An object has its stored value accessed other than by an lvalue of an allowable type (strict aliasing) 👍 👍

People who want to write unsafe code often consult the C standard to see what works and what doesn't work. I believe it would be good to have an informal document that goes through the C standard and compares C semantics to the current Rust semantics.

Besides the differences regarding pointers, another example are unions which do not have an active field in Rust and can therefore be used to implement transmute.


PS: I haven't yet have time to look at your links.

@RalfJung
Copy link
Member

RalfJung commented Oct 13, 2020

wrapping_add is just an operation that Rust has that C could have but doesn't -- we also have add/offset corresponding to the usual C-style rules for pointer arithmetic. But I take your other points. Rust raw pointers are more relaxed than C's.

The fact remains that both offset and wrapping_offset just make no sense for pointers without provenance. The only reason these operations exist is because they preserve provenance, and the people introducing them considered it important that LLVM has that information available for its optimizations. That is why I am working under the assumption that raw pointers should have provenance.

Note that raw pointers not having provenance does not really solve any of the hard problems, it just moves them around. Everything that is currently tricky about casting and transmuting between raw pointers and integers, then becomes tricky about casting and transmuting between raw pointers and references. We have to figure out answers to these questions anyway, we have to find some good way to handle the "boundary" between what has provenance and what does not. Where we put that boundary is mostly orthogonal.

@digama0
Copy link

digama0 commented Oct 14, 2020

Actually, those two functions seem like the main argument for keeping provenance, and arguing in this way is very clearly letting LLVM design decisions "leak" into Rust, because we originally added the function because LLVM has it, but LLVM has weird semantics for it so now we have to support their weird semantics.

Currently, offset and wrapping_offset are UB if the provenance is wrong, e.g. you go out of bounds somewhere. So it would be sound to make them defined, i.e. make them be a plain addition of numbers. Can we seriously investigate the cost of doing so (losing the optimizations that GEPi provides here)? Note that we could still use GEPi for accesses using references, so on the assumption that code that directly uses offset (and benefits from this optimization vs general optimizations on mathematical functions) is rare, the performance loss may not be significant.

With numbers on this we could make a more informed judgment about whether it is worthwhile to have two distinct pointer tracking mechanisms in rust (SB and pointer provenance).

@RalfJung
Copy link
Member

RalfJung commented Oct 14, 2020

I am all in favor of gathering the data, but do not have the experience needed to do so. (Note that one would also have to change how (*raw_ptr).field gets compiled, which currently effectively uses offset.)

With numbers on this we could make a more informed judgment about whether it is worthwhile to have two distinct pointer tracking mechanisms in rust (SB and pointer provenance).

Note that SB's pointer tags are a form of pointer provenance, so terminology is a bit mixed up in that sentence. See here for a definition of provenance. C-style "tracking of the original allocation" is just one example of provenance; Stacked Borrows tags are another.

Also, the discussion was about whether raw pointers should have any provenance or not. In your last sentence you are making a totally different statement. I do not know if using SB tags can entirely supplement tracking allocation IDs (for references or raw pointers) -- and even if they can in principle, I do not know if it is possible to do this in a way that is compatible with LLVM. Furthermore, if SB can supplement allocation IDs, then once raw pointers have SB, it could do so there as well! So we could have raw pointers with provenance, and still have only one form of provenance.

To conclude, I see very little connection between the discussion we had before, and "do we need both SB tags and allocation IDs". You made a huge jump there in you reasoning that I think is not backed by arguments. I don't even see having allocation IDs as a problem, they are not particularly complicated.

What is the problem you are trying to solve? All the time it seemed like you wanted to solve the issues around casting between raw pointers and usize, which are caused by raw pointers having provenance. So this has nothing to do with whether we have allocation IDs elsewhere in the semantics or not. Also see my argument above that as long as we have any kind of provenance (SB tags or alloc IDs) anywhere, this problem remains. So even if we only have SB tags and even if raw pointers have no provenance, we still have to solve this problem.

@digama0
Copy link

digama0 commented Oct 14, 2020

Note that SB's pointer tags are a form of pointer provenance, so terminology is a bit mixed up in that sentence. See here for a definition of provenance. C-style "tracking of the original allocation" is just one example of provenance; Stacked Borrows tags are another.

I realize this; but SB pointer tags only operate on references, not raw pointers. As I've tried to say, I would really rather have pointers be integers (perhaps with some kind of undef but otherwise just a bit pattern), because tracking pointer chunks through arbitrary mathematical operations is pretty clearly a doomed enterprise (and one that leaves the spec in a shambles). Having SB for references and C style pointer provenance for raw pointers seems like the worst of both worlds.

I do not know if using SB tags can entirely supplement tracking allocation IDs (for references or raw pointers) -- and even if they can in principle, I do not know if it is possible to do this in a way that is compatible with LLVM.

Hm, this is an interesting thought. My gut says it should be possible, since an SB borrow implies that the reference has the same LLVM style pointer provenance as the source of the borrow, but it deserves a full exploration, probably in an issue of its own.

Furthermore, if SB can supplement allocation IDs, then once raw pointers have SB, it could do so there as well! So we could have raw pointers with provenance, and still have only one form of provenance.

Is this referencing an extension of SB where raw pointers also get tags? I don't have a very good conception of what this would look like, and a suspicion that it is just as hard to get right as C style pointer provenance.

What is the problem you are trying to solve? All the time it seemed like you wanted to solve the issues around casting between raw pointers and usize, which are caused by raw pointers having provenance. So this has nothing to do with whether we have allocation IDs elsewhere in the semantics or not.

What I'm trying to solve is to solve the issues around ptr-to-int casts by declaring raw pointers to be integers with no provenance. References on the other hand are tracked by SB and so have a form of provenance. In my previous remarks I use the term SB for that and reserve "pointer provenance" to refer to C / LLVM alloc ID style provenance, but yes I take your point that SB is also pointer (or rather reference) provenance.

Also see my argument above that as long as we have any kind of provenance (SB tags or alloc IDs) anywhere, this problem remains.

referring to:

Note that raw pointers not having provenance does not really solve any of the hard problems, it just moves them around. Everything that is currently tricky about casting and transmuting between raw pointers and integers, then becomes tricky about casting and transmuting between raw pointers and references. We have to figure out answers to these questions anyway, we have to find some good way to handle the "boundary" between what has provenance and what does not.

Once the boundary is drawn, it seems that the questions mostly answer themselves. If you cast a reference to a pointer, you lose the tag. If you cast a pointer to a reference, you get a reference with "raw" tag, which acts like a pointer until/unless it is retagged. SB already supports these operations, so I'm not sure what the hard part is.

The main "issue" is that by making a choice, this spec is definite enough to determine a set of allowed optimizations, and so we can look at whether our LLVM lowerings are allowed. The most notable non-validated lowerings are offset and wrapping_offset, since if pointers don't have provenance then we can't use GEPi. So all that remains is the empirical question of whether using a more conservative lowering has acceptable performance, and if not, what alternate APIs we can provide to recover the optimizations (or if that's not an option, then it's back to the drawing board with pointer provenance).

@RalfJung
Copy link
Member

RalfJung commented Oct 17, 2020

but SB pointer tags only operate on references, not raw pointers.

As I said repeatedly, that is just the status quo and was never meant to be the final result. But without things like rust-lang/rust#64490, it is hard to write code correctly otherwise.

Also, note that even currently this is only mostly true -- you can transmute a reference to a raw pointer, and then you have a raw pointer with a proper tag. So optimizations cannot assume things like "raw pointers have no tag".

Is this referencing an extension of SB where raw pointers also get tags? I don't have a very good conception of what this would look like, and a suspicion that it is just as hard to get right as C style pointer provenance.

Yes that's what I was referring to. And indeed the problems around ptr-int-casts are similar to the ones in C. But the same is true in the status quo, that just moves the problem from the raw-ptr / int boundary to the reference / raw-ptr boundary.

Once the boundary is drawn, it seems that the questions mostly answer themselves. If you cast a reference to a pointer, you lose the tag. If you cast a pointer to a reference, you get a reference with "raw" tag, which acts like a pointer until/unless it is retagged. SB already supports these operations, so I'm not sure what the hard part is.

If it would be that easy, we could do the same by drawing a clear boundary between raw ptrs and integers and saying that the casts do all the necessary adjustments. But this proposal solves nothing as it leaves the hard question unanswered: what if you transmute across the boundary? That is the unsolved issue around raw ptrs with provenance, and it is just as unsolved when references have provenance but raw ptrs have not.

In fact, we can ignore raw pointers for this. We seem to agree that references have provenance but integers do not. Now, what is the behavior of transmuting a reference to an integer, or (equivalently) doing a *const usize load from a pointer that actually points to a reference? There are only unsatisfying answers proposed so far:

  • We make it UB. Ugh. (This basically means that the validity invariant of integers says "this must be initialized memory without provenance".)
  • We obtain an integer with provenance. That breaks all sorts of arithmetic properties of integers.
  • We strip provenance, i.e., this is an implicit cast. This breaks redundant store elimination:
unsafe fn foo(x: *mut usize) { 
  let val = *x;
  *x = val; // We should be able to remove this store, but if the load strips provenance, we cannot.
}

@digama0
Copy link

digama0 commented Oct 17, 2020

In fact, we can ignore raw pointers for this. We seem to agree that references have provenance but integers do not. Now, what is the behavior of transmuting a reference to an integer, or (equivalently) doing a *const usize load from a pointer that actually points to a reference? There are only unsatisfying answers proposed so far:

  • We make it UB. Ugh. (This basically means that the validity invariant of integers says "this must be initialized memory without provenance".)
  • We obtain an integer with provenance. That breaks all sorts of arithmetic properties of integers.
  • We strip provenance, i.e., this is an implicit cast.

I might be missing all the implications of this, but I would like to say that transmutes also strip provenance. I recall you arguing somewhere that transmutes can't change the value at all, but I forget the details of this, and at least in the simple case it seems like casting a &mut T to a *mut T can just have the same effect as an as-cast, except that a transmute also has recursive effects when transmuting struct types with embedded references.

If this isn't workable, it might also be possible to defer the casting, getting a value with provenance sitting in a struct which nominally is supposed to have a raw pointer or integer value at that position, and the provenance is stripped only lazily, when that field is actually loaded.

This breaks redundant store elimination:

unsafe fn foo(x: *mut usize) { 
  let val = *x;
  *x = val; // We should be able to remove this store, but if the load strips provenance, we cannot.
}

As long as provenance is stripped eagerly, there should be no problems justifying this, the value being written has no provenance but the original value didn't have any either. But if provenance is stripped lazily this is more complex, because this actually has an effect on the memory, assuming that the usize value there was actually a reference with a tag.

One way to sidestep the problem is to have an operation strip_provenance(x: *mut u8) which has the same effect as foo on the RAM state but has no codegen. Then the redundant store elimination pass would instead replace the store with a call to this intrinsic, and it can be cleaned up at a lower level once the shadow state is no longer relevant (i.e. in LLVM).

Originally, I wanted to say that we should be able to just randomly clear provenance in memory whenever we want, but I guess this can make non-UB code UB. What are the arguments for provenance existing in memory? Up until now I've mostly been thinking about reference values having tags, but I guess SB also allows them to be stored in memory, still with tags, in the manner of your "what's in a byte" post. I recognize the need for the borrow stack, but what goes wrong if all stores to memory just turn into 0's and 1's? It seems a bit bizarre that rust has untyped memory yet allows storing provenance in memory like that. (Forgive the naive questions, I haven't been thinking about this problem as long as you so I'm sure my questions and solutions are naive.)

@RalfJung
Copy link
Member

RalfJung commented Oct 17, 2020

I might be missing all the implications of this, but I would like to say that transmutes also strip provenance

That is the third option then. And yes I think you are missing the implications. ;)

nd at least in the simple case it seems like casting a &mut T to a *mut T can just have the same effect as an as-cast

Do you mean "transmuting a &mut T to a *mut T"?

FWIW, most of this discussion is already carried out in this paper that I referenced above. Ctrl-F "Type Punning". Since you seem interested in this topic, I think it would be a good idea to do some background reading to learn about the options people already thought about and why they do not work. :)

As long as provenance is stripped eagerly, there should be no problems justifying this, the value being written has no provenance but the original value didn't have any either.

I don't know what "eager provenance stripping" is... are you saying casting a reference should affect the memory it points to? That has no chance at all to work, it means casts have side-effects so they cannot be reordered any more. They also cannot be removed any more even if their result is unused as that would remove their side-effects.

This is the same reason that this earlier model on int-ptr-casts does not work: casts must be pure operations not affecting memory and not depending on memory, otherwise optimizations around casts are too severely limited.

What are the arguments for provenance existing in memory?

When you have a value x, store it to memory, and then load it back again, I hope we agree that you should get the same value. If you don't do this you just killed store forwarding which is a tremendously important optimization.

@digama0
Copy link

digama0 commented Oct 17, 2020

I don't know what "eager provenance stripping" is... are you saying casting a reference should affect the memory it points to? That has no chance at all to work, it means casts have side-effects so they cannot be reordered any more. They also cannot be removed any more even if their result is unused as that would remove their side-effects.

You are right, this doesn't work.

When you have a value x, store it to memory, and then load it back again, I hope we agree that you should get the same value. If you don't do this you just killed store forwarding which is a tremendously important optimization.

Yes and no. You get back a "weaker" version of the value, about which you can prove less optimizations, but you can still do store forwarding. I guess it's equivalent to casting your reference to an integer as you write it and back to a reference when you read it, losing the tag. And here also you can use an intrinsic operation to strip provenance at the value level: that is,

fn foo(ptr: *mut &usize) {
  let x = 3;
  let y = &x;
  *ptr = y;
  bar(*ptr);
}

with store forwarding becomes

fn foo(ptr: *mut &usize) {
  let x = 3;
  let y = &x;
  *ptr = y;
  bar(strip_provenance(y));
}

where strip_provenance(y) is equivalent to y as *const T as &T, or perhaps laundering through usize if *const T preserves provenance. Certainly if we call bar(**ptr) we have enough here to optimize it to bar(3) so I wouldn't say that store forwarding is killed.

@RalfJung
Copy link
Member

I am very doubtful that these strip_provenance operations will not be huge optimization barriers everywhere. At this point everything we can say about this semantics is highly speculative.

Also this makes "the content of a local variable" something special, but it really is not -- y in your foo is also just a name for some address in memory where the value of y is stored. (This is witnessed by the fact that you can take a pointer to y.) But of course we want y to remember its provenance. So you end up with two kinds of memory, some that have provenance and some that do not... and the memory that has provenance (where local variables are stored) still has all the same problems as before.

@mahkoh
Copy link
Author

mahkoh commented Oct 17, 2020

The opposite direction, *mut T as usize, will most likely not be equivalent to a transmute -- in fact so far there is no good proposal on the table for how to make the transmute not UB, which is rather unsatisfying. (This paper is one of the most recent proposals.)

I've read this paper now. In 4.9 loading an usize from a pointer-to-pointer was defined to yield poison. I assume this is also what you are referring to here.

This seems rather strange since the safe-transmute RFC will most likely make transmuting *mut T to usize (and hence &*mut T to &usize) safe. At least I would be very surprised if it didn't.

@RalfJung
Copy link
Member

This seems rather strange since the safe-transmute RFC will most likely make transmuting *mut T to usize (and hence &*mut T to &usize) safe. At least I would be very surprised if it didn't.

Indeed, it is rather unsatisfying. But the other alternatives on the table are arguably worse.

The safe-transmute RFC should probably hold back on explicitly blessing such transmutes...

@mahkoh
Copy link
Author

mahkoh commented Oct 17, 2020

I suggest you make a comment to that effect there. I do not think many people there are aware of these issues.

@Diggsey
Copy link

Diggsey commented Oct 17, 2020

What are the problems with this model:

  • All memory locations have both a value and optionally some provenance information.
  • Provenance is semantically a set of allocation IDs from which the pointer is derived.
  • Provenance is defined in such a way that deleting it can never change the behaviour of a well-defined program. (ie. no provenance is equivalent to the pointer being tagged with all possible allocation IDs)
  • All pointer/reference operations operate on both the value and the provenance.
  • All integer operations ignore provenance, and all integer writes delete any provenance information at the target memory location.
  • When two pointers are compared, each pointer gains the provenance of the other, so if the first had provenance {A, B} and the second {B, C} then after comparison both pointers have provenance {A, B, C}. (The one exception to this could be that in the "not equal" branch of an exact comparison, we learn nothing about the pointers and so their provenance can remain unchanged)

When store forwarding replaces an integer load/store, it must delete any provenance information at the target memory location instead of removing the operation entirely.

@RalfJung
Copy link
Member

RalfJung commented Oct 18, 2020

and all integer writes delete any provenance information at the target memory location.

Depending on what exactly you mean by this, this invalidates dead store elimination:

unsafe fn foo(x: *mut usize) {
  let val = *x;
  *x = val; // deletes provenance in target location
}

If we remove the store, that means the transformed program has more provenance information, which can introduce UB into a previously UB-free program.

All integer operations ignore provenance,

What does it mean to "ignore" provenance? If you have two integers a and b with provenance A and B respectively (they might have provenance because they got transmuted from a reference), what is the provenance of a+b? If the answer is "none", note that this proposal also invalidates turning x+0 into x, because the former has no provenance but the latter inherits x's provenance.

Provenance is defined in such a way that deleting it can never change the behaviour of a well-defined program. (ie. no provenance is equivalent to the pointer being tagged with all possible allocation IDs)

There is no proposal for an aliasing model yet that actually satisfies this condition. If you have a valid safe Rust program and delete the provenance (Stacked Borrows tag) of some reference, it can become UB.

@Diggsey
Copy link

Diggsey commented Oct 18, 2020

If we remove the store, that means the transformed program has more provenance information, which can introduce UB into a previously UB-free program.

The dead store cannot be removed from the IR, but at some point you're going to be taking the IR to actual machine code, at which point those stores can be removed. (Since provenance doesn't actually exist on real hardware)

what is the provenance of a+b? If the answer is "none", note that this proposal also invalidates turning x+0 into x, because the former has no provenance but the latter inherits x's provenance.

If x is an integer then turning x+0 into x is fine, because storing the integer result would still delete any provenance information at that memory location.

There is no proposal for an aliasing model yet that actually satisfies this condition. If you have a valid safe Rust program and delete the provenance (Stacked Borrows tag) of some reference, it can become UB.

Surely as soon as FFI is involved you're going to have to deal with pointers whose provenance it unknown. For example, let's say there's a C function identity(ptr) that just returns the pointer you passed in, but which the compiler has no insight into.

It would be suprising to me if I could cause a program to exhibit UB simply by converting uses of ptr to use identity(ptr) instead.

@digama0
Copy link

digama0 commented Oct 18, 2020

If x is an integer then turning x+0 into x is fine, because storing the integer result would still delete any provenance information at that memory location.

Storing the result would be fine, but what if you instead take the result and transmute it back to a reference and use it? You could get into a situation where it is not UB if the provenance is erased by the x+0 operation, but simplifying to x would bring back the provenance and if it is bad for the operation then UB could result, so the optimization is introducing UB. (As I've mentioned, I think you need a strip_provenance intrinsic to represent such stripped values in order to safely perform the optimization.)

@RalfJung
Copy link
Member

RalfJung commented Oct 18, 2020

The dead store cannot be removed from the IR, but at some point you're going to be taking the IR to actual machine code, at which point those stores can be removed. (Since provenance doesn't actually exist on real hardware)

Sure, but that will likely still be a problem for later optimization passes on the IR -- "this piece of code does not write to memory" is a useful property.

If x is an integer then turning x+0 into x is fine, because storing the integer result would still delete any provenance information at that memory location.

As @digama0 already said -- but what if I transmute the integer to a pointer? Then transmute(x) would have provenance but transmute(x+0) would not.

Surely as soon as FFI is involved you're going to have to deal with pointers whose provenance it unknown. For example, let's say there's a C function identity(ptr) that just returns the pointer you passed in, but which the compiler has no insight into.

In current Stacked Borrows, when a reference is cast to a pointer, the relevant memory gets marked as "accessible to pointers with unknown provenance". I hope to move this change to ptr-to-int casts at some point which probably better models LLVM behavior, but as this discussion shows, LLVM behavior is unclear.

It is absolutely imperative that pointers with unknown provenance can not access memory that an &mut points to -- every aliasing-based optimization needs this. But from this it follows that if we just forget the provenance of an &mut, we will definitely have UB. Any non-trivial aliasing model therefore necessarily violates your condition that provenance can always be forgotten without introducing UB. (FWIW, the same is true for a model of C provenance that accounts for restrict.)

@mahkoh
Copy link
Author

mahkoh commented Oct 18, 2020

It is absolutely imperative that pointers with unknown provenance can not access memory that an &mut points to

I haven't read the rest of the conversation but I'm sure that this is not the case in current rust code if we apply the provenance rules from your paper. &mut -> *mut -> usize -> *mut -> &mut happens all over the place in unsafe code.

@Diggsey
Copy link

Diggsey commented Oct 18, 2020

It is absolutely imperative that pointers with unknown provenance can not access memory that an &mut points to -- every aliasing-based optimization needs this. But from this it follows that if we just forget the provenance of an &mut, we will definitely have UB.

So this is UB?

extern "C" {
    fn identity(x: &mut i32) -> &mut i32;
}

fn bad(x: &mut i32) {
    *identity(x) = 42;
}

@RalfJung
Copy link
Member

RalfJung commented Oct 18, 2020

&mut -> *mut -> usize -> *mut -> &mut happens all over the place in unsafe code.

Indeed, and that's why -- quoting from my previous message -- "when a reference is cast to a pointer, the relevant memory gets marked as 'accessible to pointers with unknown provenance'".

So this is UB?

It is UB if it forgets provenance. But then really you should not call it identity. If, on the Abstract Machine level, it truly is an identity function, then provenance is preserved.

Defining precisely how FFI works is really complicated (and I am not an expert, I know just enough to avoid the topic), so I don't think we want to go there. I also don't see how it helps the discussion. If you want to talk about linking and FFI precisely, you need a shared memory model of the two sides that get linked -- so if the linking happens while provenance still exists, both sides need to be aware that there is provenance, and define what happens to it. If the linking happens on the assembly level, there is no provenance and thus no provenance-related problems.

We have to assume that C code does not have any capabilities to affect Rust code in a way that goes beyond what Rust code can do (otherwise, sound linking is simply impossible -- every way in which the foreign code could affect the Rust Abstract Machine needs to be explicitly account for in said Abstract Machine). So any counterexample you have in mind, you should be able to write entirely in Rust. FFI is a red herring.

@Diggsey
Copy link

Diggsey commented Dec 14, 2020

Sure, I was just using FFI as an example of a black-box, something into which the compiler has no visibility. It sounds like you're saying that in these cases you need to mark the memory as "accessible to the whole program". If so, that's what I meant by "forgetting the provenance" of a pointer.

@RalfJung
Copy link
Member

It sounds like you're saying that in these cases you need to mark the memory as "accessible to the whole program".

Not really... that's just "an arbitrary sequence of Abstract Machine instructions, but we don't know which". However we know which instructions the Abstract Machine has, so we still can put some bounds of what that arbitrary code might do.

So, the compiler has to assume that any value passed to that black box is available via some global variable to everything else. But on the Abstract Machine level, there's no "forgetting the provenance" happening here.

@Diggsey
Copy link

Diggsey commented Dec 15, 2020

So, the compiler has to assume that any value passed to that black box is available via some global variable to everything else. But on the Abstract Machine level, there's no "forgetting the provenance" happening here.

How is that a meaningful distinction?

If (in Rust, and under the "integers don't have provenance but pointers do" rules) I cast a pointer to an integer and back. It would seem to me to be exactly equivalent to passing the pointer into a black-box function and receiving a new pointer back.

The black-box example must be as weak as the cast, since the black-box could by definition do anything (including cast).
The cast example must be as weak as the black-box, since integers do not have provenance, and so we know nothing about where the pointer originated.

Finally, we know that replacing x with black_box(x) must never introduce UB, because that would be an insane rule for programmers to reason about (since the programmer may know that black_box really is the identity function). Therefore, since black_box(x) results in the loss of provenance information, it is impossible for the loss of provenance information to result in UB.

In both cases, we still know that it cannot alias with some other pointers (which we know did not escape to some global variable, or be casted to an integer). This is all that I mean by saying the provenance is unknown.

It does not seem necessary to me to pick in advance and say that "all pointers have provenance and all integers do not". We can say that we have provenance information for some pointers and for some integers. We can define that (most) operations on integers return a new integer without provenance. We can definite at the IR level an operation (like black-box) which destroys that provenance information, and we can define optimizations (like the equality optimization) for both integers and pointers, as long as that optimization first wraps the inputs in black_box() - converting them to pointers or integers without provenance information.

Admittedly, this replacement will have a cost (since provenance is useful for other optimizations) but that cost can be weighed as part of the trade-off in applying the optimization. It's just the same old "what order do we apply optimizations in" problem. At least the optimizations are valid whatever the order.

@RalfJung
Copy link
Member

RalfJung commented Dec 16, 2020

How is that a meaningful distinction?

I think there is a huge distinction between what we "bake in" to the Abstract Machine (and implement in Miri), and the reasoning principles that follow from that.

If (in Rust, and under the "integers don't have provenance but pointers do" rules) I cast a pointer to an integer and back. It would seem to me to be exactly equivalent to passing the pointer into a black-box function and receiving a new pointer back.

I cannot see any way in which these two would be "equivalent" -- so, I do not understand what notion of "equivalence" you mean. The standard notion of equivalence on programs is "contextual equivalence", which basically means the two pieces of code are equivalent if you can replace one by the other without changing program behavior in any way that can be observed from the outside. Under this notion, an int-ptr-roundtrip is clearly very distinct from an arbitrary function... the function might e.g. add 3 to the pointer address, which the int-ptr-roundtrip will never do.

It does not seem necessary to me to pick in advance and say that "all pointers have provenance and all integers do not". We can say that we have provenance information for some pointers and for some integers.

No I think that's wrong. In particular, optimizations such as equality-test-based GVN (the first optimization in my blog post) rely on integers never having provenance. If there is even the possibility of a value of integer type having provenance, the optimization is wrong in the general case (the optimization only remains correct if we can prove that the integers being tested here do not have provenance, but of course that will be de-facto impossible in the general case).

One way to view this is that by adding the possibility of integers having provenance, the set of possible values of integer type is increased, so each optimization working on such values now needs to also consider the new values that were added. (This is similar to how the mere existence of LLVM's undef breaks some optimizations.)

But I also lost track of the problem we are trying to solve here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-support Category: Supporting a user to solve a concrete problem
Projects
None yet
Development

No branches or pull requests

5 participants