Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] mtime+content tracking #8623

Closed
wants to merge 41 commits into from
Closed

Conversation

gilescope
Copy link
Contributor

@gilescope gilescope commented Aug 16, 2020

This adds the unstable option -Zhash-tracking that uses content hashes and file sizes as well as mtimes to prevent compile cascades when mtimes are messed with but the content is unchanged (Fixes: #6529 ). This paves the way towards fixing #5918.
It works even better with -Zbinary-dep-depinfo.

It depends on rust-lang/rust#75594 with the following enabling rustc changes:

  • .rmeta now has SVH in uncompressed header as well as rust version.
  • .rlib previously contained a lib.rmeta file - now renamed to lib-crate-name-svh.rmeta
  • dylibs now have an additional SVH symbol.
  • .d dependency files contain expected file size and hash.

This is almost zero cost when mtimes are up-to-date (rustc already hashed everything it needed to but didn't expose this to cargo). Cost is only incurred when mtime checks fail and typically the filesize check means we can skip a more expensive hash check.

Worst case: if the files mtimes are not up to date and files are different but the same size then it will take cargo slightly longer to detect this and drop back to a full rebuild.

I appreciate that not everyone might be sold on content hashing right now, but it would be great to let them try the functionallity on nightly to see whether it makes some people's lives easier in the field.

Testing:

  • Seems to work on OSX with everything fresh when moving the target dir for findshlibs (30 crates) and rust-analyzer (200 crates).

Still todo:

  • cache hashes with the mtimes for src files
  • put svh in rmeta
  • put svh in dylib
  • put svh in .rmeta in .rlib
  • read svh in .rmeta in .rlib
  • read svh in dylib
  • read svh in .rmeta
  • support build.rs
  • integrate awesome feedback
  • write additional tests
  • release notes
  • bikeshed name: -Zhash-tracking, -Zcontent-hash or merge in with track bin deps?

Unresolved questions:

  • Will cross compilation just work?
  • Should the additional crate dependencies be put under a default feature?

@rust-highfive
Copy link

r? @Eh2406

(rust_highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 16, 2020
src/cargo/core/compiler/fingerprint.rs Outdated Show resolved Hide resolved
src/cargo/core/compiler/fingerprint.rs Outdated Show resolved Hide resolved
@Eh2406
Copy link
Contributor

Eh2406 commented Aug 17, 2020

Is is a far more ambitious fix than what I had in mind for #6529, I can see advantages to this approach but want to make sure I understand the tradeoffs. The current mtime based handling happens entirely in Cargo. This PR handles hashes using both Cargo and Rustc. What is the advantage of involving Rustc? Given the advantages of the new way should we move the handling of mtime to it?

@gilescope
Copy link
Contributor Author

It turns out that rustc hashes all the source files (to put that hash in the debug info) anyway, so if we let rustc do the hashing then we get that initial hash (which I was expecting to be expensive) for free.

@bjorn3 mentioned that we should extend this so that we don't solely rely on mtime checking for binary depedenencies too. This seems reasonable though I've not figured out yet if there's some cunning way that those hashes can be gained for 'free' also - that would seem to be too good to be true, but idk, rustc might happen to have that data already somewhere.

@gilescope
Copy link
Contributor Author

gilescope commented Aug 20, 2020

In terms of trade-offs so far it looks very positive - If we can get the initial hashes for free and we only attempt to hash a file if the mtimes are different but the file size is the same then there's a high probability that that file is exactly the same and thus it would be a significant improvement on reducing recompiles.

  • I think the realistic worst case here is that cargo will do the additional work of hashing one file before it marks a unit dirty if and only if there's a file with a different mtime but same size as an input to that unit.

(I guess the worst worst case is that all files have been touched but only the last file has been modified in a way that didn't change the size. - Thanks that's a good point: we should try and dirty the unit cheaply via a file size mismatch before having to do any hash checks. I'm assuming that hashing the input files is always going to be considerably cheaper than doing the compile.)

@bjorn3
Copy link
Member

bjorn3 commented Aug 20, 2020

For binary dependencies there is already the SVH. This is not stored in a stable place though. There are two ways I can think of to get a binary dependency hash. The first is to define that the SVH is stored in a stable location. This would require cargo to include a ar archive and object file reader though. The second is to hash the whole file while reading the metadata. This would result in a performance decrease though, as we currently use mmap to prevent reading parts of the binary dependencies and even parts of the actual metadata that we don't need to compile the current crate.

@gilescope
Copy link
Contributor Author

Ok updated to take advantage of rustc's SourceFile precomputed hashes. Next time to explore bin hashing.

@gilescope
Copy link
Contributor Author

I worry reproducing a SVH would require quite a bit of parsing compared to md5.

Potentially we could use the fact that cargo only refences parts of files as a benefit. - For the parts of the files that cargo does read, would there be things in there very likely to change if the other (unread) parts of those file changed? If so we could store a hash + range(s) of which parts of the file contributed to the hash. I.e. leave instructions as to how one could re-create the hash.

@gilescope
Copy link
Contributor Author

gilescope commented Aug 30, 2020

Ah, my bad, we don't need to recompute a SVH, just compare already created SVHs. As long as fishing the SVH out of a binary then we're good.

(I've found out that hashes are inserted when linked with /Brepro for mach-o/win but not for linux.)

If we embed the SVH then we have a single consistent mechanism). Comparing SVHs would be much cheaper than hashing files and as a bonus it fixes rust-lang/rust#73917 .

@bjorn3
Copy link
Member

bjorn3 commented Aug 30, 2020

You can use the ar crate for reading rlibs and object for reading object files. These are what I use for cg_clif.

@bjorn3
Copy link
Member

bjorn3 commented Sep 5, 2020

Optimisation: No need to figure out if bin files are up to date if they have svh hashes in their filenames.

Svh are never part of the filename as far as I know.

@gilescope
Copy link
Contributor Author

gilescope commented Sep 5, 2020

I might have got the wrong end of the stick but it looks like svh is the hash part that is in quite a few of the filenames: https://github.com/rust-lang/rust/blob/02fe30971ef397bcff3460a9aaf175e0810c2c90/compiler/rustc_incremental/src/persist/fs.rs#L37 - I'm thinking if the svh hash is in the dependencie's filename then as long as that file exists it really ought to contain the right contents. If we can rely on that then we only have to worry about the few bin files without hashes in their filenames.

EDIT: I had got the wrong end of the stick and instead we can put the svh inside the .rlib in the lib.rmeta filename.

@bjorn3
Copy link
Member

bjorn3 commented Sep 5, 2020

The svh is only the hash part for incremental compilation directories. In all other cases it is a Cargo calculated hash passed to rustc using -Cextra-filename that is unique for each crate + compilation options pair, but does not depend on the source files.

@gilescope gilescope changed the title WIP: mtime+content tracking [WIP] mtime+content tracking Sep 20, 2020
@bors
Copy link
Collaborator

bors commented Oct 14, 2020

☔ The latest upstream changes (presumably #8778) made this pull request unmergeable. Please resolve the merge conflicts.

Note that reviewers usually do not review pull requests until merge conflicts are resolved! Once you resolve the conflicts, you should change the labels applied by bors to indicate that your PR is ready for review. Post this as a comment to change the labels:

@rustbot modify labels: +S-waiting-on-review -S-waiting-on-author

@gilescope
Copy link
Contributor Author

Ah that's better: cx.add_used_global(llglobal); -- that's got me moving again.

@gilescope
Copy link
Contributor Author

Hmm, now I have more symbols than I can shake a stick at - I have one for each code gen unit, for each crate. I can just create it for the first code-gen unit, but still seems to be one per crate - maybe I can only do it for the current crate.

@bjorn3
Copy link
Member

bjorn3 commented Oct 25, 2020

For dylibs there is a separate codegen unit for metadata. You can add the symbol there. For executables you can add it to the main shim generation. For rlibs you can add a separate archive member.

@gilescope
Copy link
Contributor Author

Tests-wise I think we want to run all those freshness tests with content hashing turned on.

@gilescope
Copy link
Contributor Author

Rust analyzer fresh build when moving the target dir to upset the mtimes currently builds fresh (which is great) but takes 6-12s to figure out everything is fresh. This is much better than a full 2m30 build that would otherwise happen. I suspect where we do have to hash files we should be doing that in parallel.

@gilescope
Copy link
Contributor Author

@bjorn3 both rustc and cargo PRs are back in sync now. Seems like things are slightly twisted on the hashes front. I think reversing the slices will probably untwist it, but what's there is consistent and works for findshlibs. Now that sha256 has been added in (in master), I'm more minded to plump for 256 rather than 512 as that will tidy the code up a little.

.files
.iter()
.find(|reference| *dep_in == reference.path);
if let Some(reference) = ref_file {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: no else here marking it as stale.

@bors
Copy link
Collaborator

bors commented Dec 7, 2020

☔ The latest upstream changes (presumably #8954) made this pull request unmergeable. Please resolve the merge conflicts.

Note that reviewers usually do not review pull requests until merge conflicts are resolved! Once you resolve the conflicts, you should change the labels applied by bors to indicate that your PR is ready for review. Post this as a comment to change the labels:

@rustbot modify labels: +S-waiting-on-review -S-waiting-on-author

@ehuss
Copy link
Contributor

ehuss commented Jul 1, 2021

Ping @gilescope, are you still interested in moving this forward? Is there somewhere you are stuck or need help with?

@gilescope
Copy link
Contributor Author

gilescope commented Jul 1, 2021 via email

@gilescope
Copy link
Contributor Author

Sorry still intend on breathing life back into this just life is getting in the way.

@ehuss
Copy link
Contributor

ehuss commented Mar 22, 2022

I'm going to close this due to inactivity. Unfortunately at this time we don't have the capacity to review major changes like this, but this is still a feature that we would love to see happen someday. If you are interested in resuming the work, please reach out to see if we can accept help with working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(Option to) Fingerprint by file contents instead of mtime
7 participants