-
-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Caching git heights #499
Caching git heights #499
Conversation
Unsure on the source of the current set of build failures, would like confirmation that these are new/ significant because I can't immediately see how my changes might have caused those exceptions (the failing tests pass locally for me). |
Fixed the build issues by adding some apparently required dependencies. Not sure why, I don't believe I changed the dependency requirements of the test project. |
I tested it on our repository and it does help somewhat. However, fundamentally it still caches on a per-project basis which helps only on incremental rebuilds. If the That said, even this type of cache helps a lot. According to the binary msbuild logs the time spent in |
Hey @filipnavara, thanks for the feedback. I hadn't considered the single |
I believe that we get the same git height for every project (ie. relative to the version.json location) in our scenario. At least it's like that for all the binaries I checked. I cannot tell whether it is intentional behavior or something specifically triggered in our particular configuration. |
After reading the source code it appears to me that the behavior with the GIT height being independent of the project directory is intentional. However, the version.json file could have been moved in the history (ie. from per-project file to top-level or vice-versa). I'd have to think a bit about it and whether it would affect the caching behavior or not. |
Just wanted to chime into this conversation - our use case is that we have a single Git repository with multiple projects, but a single MSBuild logs show that we spend at least 30 seconds in NB.GV during each build. It seems that having a single As a sidenote, we update |
It's particularly expensive because it looks up the location of version.json in each commit where it walks the history which may result in significant disk I/O. This just happens to be expensive and aside from the cache or some more fundamental concept change there's only limited room for improvement. There are ways to optimize it a bit through leveraging the GIT hash IDs of the trees (and making assumptions based on the fact they didn't change) or by using the newly introduced commit graph files to statistically pre-filter based on paths and examine only small subset of commits (likely less than 3%). |
I'd recommend the both of you take some traces with Perfview- this way you'll be able to understand where the costs in your build times are coming from. @filipnavara I think you're right about the costs for some scenarios- the I/O of reading files is expensive. I think that adding caching inside Basically this PR isn't meant to fix the world's ills- it's just one of hopefully many future PRs to improve performance. |
I am not expecting you to widen the scope of this PR. I've already done some of the performance tracing and I was just sharing the results and bottlenecks for my particular use case. That said, if the |
Let me do some testing for the top-level scenario and get back to you. I have a feeling it might be tricky to pull off however. |
A quick GitHub search seems to indicate that most repositories keep their Let me dig a bit deeper and see if I can get some numbers. I'd expect that leveraging the Git hashes of trees could also have a positive impact, let's see if it's quantifiable. @djluck I'd agree that done is better than perfect, and yes further optimizations can be done in separate PRs. |
Interesting, I can see now after reading the documentation that my organization's use is a little atypical. I feel it should be possible to place the |
@djluck If all else fails, you could always consider storing the cache in the Git object storage (using e.g. |
@qmfrederik I'm not so familiar with working directly with git object storage but wouldn't that require commiting the |
Well, the Git database (the |
@filipnavara @qmfrederik I've just pushed a change to support the single version file scenario- I'd be keen to see if this help you guys out. I think it needs a bit more testing but in a quick local test it seemed to work fine for me. |
@djluck Thanks, I ran a couple of builds and these changes reduce the time spent in the
|
I've tried to do the same thing as Frederik and here are my numbers. We are currently at height 105. My previous numbers were before a branching out happened (height ~3500, 6+ minutes).
|
Oh, wow. I should add that I was at a much lower height (~10, I think). The solution has ~25 projects. |
Btw, there seems to be problem when the tasks are run in parallel on our build machine:
|
It gets much worse with the height and the repetitive |
Thanks for testing guys, good to know it's effective for your use case
I expected this might be an issue- I think the right thing to do is to silently swallow the exception. |
Yeah, that would likely work. Alternatively there could be some retry logic (some other MSBuild tasks do that although I don't have any specific example). I can test it if you come up with something, or prototype it and test on our machines. |
I doubt that would be of any use in the failed write case. Whoever is writing to the file is going to write the same content that the failed code wanted to write, right? 😏👉 |
Caching the height associated with the commit and the location of the version.json file would lead to faulty cache hits when path filters are in use. I'm not opposed to storing this data in the .git directory, but I am a little concerned about the idea. I think I like it better than polluting the source tree with cache files (as we'd have to do if we were to store it next to the version.json file). But I think it's irrelevant anyway because given path filters, different projects can get different heights computed even from the same version.json file. |
I'm reviewing this PR now. I may push some changes to it so it can merge sooner. |
Also: regarding repeatedly parsing... I thought I already had a cache so that if the version.json blob was the same hash we'd reuse a prior parse result. If that's not there, I totally support adding it. |
After reading more about git objects (thanks for the suggestion, @qmfrederik), I think that's a better place to store the cache file than next to the version.json file. I had originally planned/expected to put the cache file in the project's intermediate output directory. @djluck After reviewing your changes, I'd like to try implementing in a different way that I think will allow more accurate and effective caching, and possibly fit in with the existing code with fewer changes -- I'm not sure. I'm going to try it today. Whether I take this PR or not, please know your efforts are appreciated and valuable. Same with those who have tested your PR for perf improvements. We'll either take this PR or a PR that is at least as good, and having those perf measurements really helped to validate that this is a useful thing to do. |
Thinking more about the git object database, I'm having second thoughts. Since it's a content addressable store, I can't store a value keyed by something else. It's only keyed by the hash of the content, and it's the content I don't know when I need the cached value, so I can't very well predict the hash to look up the content. Now, maybe we could still store it under the .git directory but 'somewhere else'. But I'll weigh that against what I think we can get while storing cache files in other places in build output directories. |
@AArnott thanks for your review and thoughts- I absolutely would not see this PR being thrown away as wasted effort and as it's my first attempt at contributing to this code base understand there are probably better ways to accomplish this PR's aims. If you want to use this PR and want to ignore the "optimize for a single version.json" use case, reverting the last commit in this PR should get us there. Let me know if there's anything I can do to help. |
I think this should be safe enough if you associate the cache location with the bottom-most version.json file (which is what this PR aims to do). In the worst case, you'll be leaving some perf on the table by not making use of a cache further up the hierarchy (if paths filters are not in use) but assuming path filters are in use, the cache should be correct. |
Thank you for your spirit, @djluck.
Yes, I think the single version.json file case is totally solved by #508 (provided a simple opt-in by the consuming repo). So I think I'd like to focus the remaining perf work on caching that does not require opt in and thus must preserve the same version height result in every case. |
@djluck Given the above discussion it looks like I may have dropped the ball. Did you close the PR because I took too long, or because you feel it's no longer worth reviewing? |
@AArnott this wasn't intentional- I originally pushed this change to |
GitHub doesn't let you change the source branch of a PR, so you'll have to open a new one. |
Summary
Adds caching of git heights. As calculating the height of repositories with large volumes of commits is expensive, caching the git heights can save time in the following circumstances:
GetBuildVersion
msbuild task- e.g. duringdotnet pack
'ing a project, theGetBuildVersion
task is invoked four timesImplementation details
When can cached results be used?
Opting out
The caching behavior is enabled by default but can be opted-out by setting the new
NerdbankGitVersioningUseHeightCache
msbuild property tofalse
Cache file
A
version.cache.json
file is created per-project with contents that look like this:Testing
Automated testing
GitHeightCache
can serialize + deserialize heights correctlyManual testing
Consumed locally-packed version of
Nerdbank.GitVersioning
in a C# project and added ~1500 commits, verifying:NerdbankGitVersioningUseHeightCache
property tofalse
bypasses the caching behavior.