Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caching git heights #499

Closed
wants to merge 0 commits into from
Closed

Caching git heights #499

wants to merge 0 commits into from

Conversation

djluck
Copy link

@djluck djluck commented Aug 27, 2020

Summary

Adds caching of git heights. As calculating the height of repositories with large volumes of commits is expensive, caching the git heights can save time in the following circumstances:

  • Repetitive invocations of the GetBuildVersion msbuild task- e.g. during dotnet pack'ing a project, the GetBuildVersion task is invoked four times
  • Incremental versioning- in cases where a cached height is available for older commits, this value will be used avoid the cost of recalculating the entire git height (only new commits will need to be traversed)

Implementation details

When can cached results be used?

  • The base version to calculate the height for is the same as the cached base version
  • The relative path of the cached height matches the requested relative path
  • The commit to calculate the height for is either the same OR a child of the cached commit id

Opting out

The caching behavior is enabled by default but can be opted-out by setting the new NerdbankGitVersioningUseHeightCache msbuild property to false

Cache file

A version.cache.json file is created per-project with contents that look like this:

/*Cached commit height, created by Nerdbank.GitVersioning. Do not modify.*/{"BaseVersion":"1.0","Height":1636,"CommitId":"f30bb9e26af1dc80f5c38f43d9d6a7b4f770bb14","RelativeProjectDir":""}

Testing

Automated testing

  • Verified GitHeightCache can serialize + deserialize heights correctly
  • Verified caching has a measurable impact on performance for cases when there are many commits to traverse
  • All existing tests pass

Manual testing

Consumed locally-packed version of Nerdbank.GitVersioning in a C# project and added ~1500 commits, verifying:

  • Height caching takes effect on second build, dramatically decreasing build time (~10s -> ~1s)
  • Adding an additional single commit can leverage the cached version for the previous ~1500 commits but adds the latest
  • Setting the NerdbankGitVersioningUseHeightCache property to false bypasses the caching behavior.

@djluck djluck mentioned this pull request Aug 27, 2020
6 tasks
@djluck
Copy link
Author

djluck commented Aug 27, 2020

Unsure on the source of the current set of build failures, would like confirmation that these are new/ significant because I can't immediately see how my changes might have caused those exceptions (the failing tests pass locally for me).

@djluck
Copy link
Author

djluck commented Aug 27, 2020

Fixed the build issues by adding some apparently required dependencies. Not sure why, I don't believe I changed the dependency requirements of the test project.

@filipnavara
Copy link
Member

filipnavara commented Aug 31, 2020

I tested it on our repository and it does help somewhat. However, fundamentally it still caches on a per-project basis which helps only on incremental rebuilds. If the version.cache.json file was bound to the version.json file (ie. for a solution with top-level only version.json file there would be only one version.cache.json) it would improve the performance significantly for us for scenarios beyond the simple rebuild.

That said, even this type of cache helps a lot. According to the binary msbuild logs the time spent in GetBuildVersion task went down from 6 minutes 56 seconds to less than one second.

@djluck
Copy link
Author

djluck commented Aug 31, 2020

Hey @filipnavara, thanks for the feedback. I hadn't considered the single version.json scenario (at my company we use a version.json per project so this is what I was focusing on optimizing). In this scenario, should all projects have the same git height? I would think that git height would be different for each project still as each project would have a differing number of changes in their child + descendant directories.

@filipnavara
Copy link
Member

I believe that we get the same git height for every project (ie. relative to the version.json location) in our scenario. At least it's like that for all the binaries I checked. I cannot tell whether it is intentional behavior or something specifically triggered in our particular configuration.

@filipnavara
Copy link
Member

After reading the source code it appears to me that the behavior with the GIT height being independent of the project directory is intentional. However, the version.json file could have been moved in the history (ie. from per-project file to top-level or vice-versa). I'd have to think a bit about it and whether it would affect the caching behavior or not.

@qmfrederik
Copy link
Contributor

Just wanted to chime into this conversation - our use case is that we have a single Git repository with multiple projects, but a single version.json file in the src/ directory. All projects have the same git height / version number. This means that making a change in one project affects all projects. This is our intended behavior (because we release everything as a single product).

MSBuild logs show that we spend at least 30 seconds in NB.GV during each build. It seems that having a single version.cache.json file would help a lot in our use case.

As a sidenote, we update version.json every sprint, so the git height remains relatively low (couple of hundreds of commits). Do we know why calculating git height is so expensive?

@filipnavara
Copy link
Member

As a sidenote, we update version.json every sprint, so the git height remains relatively low (couple of hundreds of commits). Do we know why calculating git height is so expensive?

It's particularly expensive because it looks up the location of version.json in each commit where it walks the history which may result in significant disk I/O. This just happens to be expensive and aside from the cache or some more fundamental concept change there's only limited room for improvement. There are ways to optimize it a bit through leveraging the GIT hash IDs of the trees (and making assumptions based on the fact they didn't change) or by using the newly introduced commit graph files to statistically pre-filter based on paths and examine only small subset of commits (likely less than 3%).

@djluck
Copy link
Author

djluck commented Aug 31, 2020

I'd recommend the both of you take some traces with Perfview- this way you'll be able to understand where the costs in your build times are coming from.

@filipnavara I think you're right about the costs for some scenarios- the I/O of reading files is expensive. I think that adding caching inside VersionFile (where we cache version.json files, keyed by the file's SHA) could help here (plus your other ideas around graphs files).
However there are other scenarios (like when you have a version.json file per-project and a version.json file at the root) where the major costs can be reading and deserializing JSON (GC costs start to stack up here too).

Basically this PR isn't meant to fix the world's ills- it's just one of hopefully many future PRs to improve performance.

@filipnavara
Copy link
Member

Basically this PR isn't meant to fix the world's ills- it's just one of hopefully many future PRs to improve performance.

I am not expecting you to widen the scope of this PR. I've already done some of the performance tracing and I was just sharing the results and bottlenecks for my particular use case. That said, if the version.cache.json file can be tied to the location of version.json it's something that would likely benefit both me and @qmfrederik within the scope of this PR.

@djluck
Copy link
Author

djluck commented Aug 31, 2020

Let me do some testing for the top-level scenario and get back to you. I have a feeling it might be tricky to pull off however.

@qmfrederik
Copy link
Contributor

A quick GitHub search seems to indicate that most repositories keep their version.json file in the root directory or one level deep (e.g. src/version.json), so it may make sense to optimize for this case.

Let me dig a bit deeper and see if I can get some numbers. I'd expect that leveraging the Git hashes of trees could also have a positive impact, let's see if it's quantifiable.

@djluck I'd agree that done is better than perfect, and yes further optimizations can be done in separate PRs.

@djluck
Copy link
Author

djluck commented Aug 31, 2020

Interesting, I can see now after reading the documentation that my organization's use is a little atypical. I feel it should be possible to place the version.cache.json file next to the version.json file, just might take me a while to figure out how to do this cleanly.

@qmfrederik
Copy link
Contributor

@djluck If all else fails, you could always consider storing the cache in the Git object storage (using e.g. sha256(commit hash + path to version.json) as the object key). That way, you can even recycle cached versions of the git commit height when switching branches. Not entirely sold on the idea yet, though.

@djluck
Copy link
Author

djluck commented Aug 31, 2020

@qmfrederik I'm not so familiar with working directly with git object storage but wouldn't that require commiting the version.cache.json file to the repository? It's something I'd like to avoid and some people may want to .gitignore this file.

@qmfrederik
Copy link
Contributor

Well, the Git database (the .git folder) is a key-value data store, in which you could store custom objects: https://git-scm.com/book/en/v2/Git-Internals-Git-Objects . You could cache the git commit height in that database. Since these objects are never referenced by commits, you'd never push them to a remote and I think they'd be garbage-collected by git over time. But again, I'm not sure cluttering the git database like that is the right thing to do.

@djluck
Copy link
Author

djluck commented Sep 2, 2020

@filipnavara @qmfrederik I've just pushed a change to support the single version file scenario- I'd be keen to see if this help you guys out. I think it needs a bit more testing but in a quick local test it seemed to work fine for me.

@qmfrederik
Copy link
Contributor

@djluck Thanks, I ran a couple of builds and these changes reduce the time spent in the GetBuildVersion task by a factor of 2-3:

NerdBank.GitVersioning Build Type GetBuildVersion Time
PR Full 4.549s
PR Incremental 4.297s
3.2.31 Full 11.862s
3.2.31 Incremental 8.253s

@filipnavara
Copy link
Member

I've tried to do the same thing as Frederik and here are my numbers. We are currently at height 105. My previous numbers were before a branching out happened (height ~3500, 6+ minutes).

NerdBank.GitVersioning Build Type GetBuildVersion Time
3.2.31 Full 20.048 s
3.2.31 Incremental 18.881 s
PR Full 1.007 s
PR Incremental 868 ms

@qmfrederik
Copy link
Contributor

Oh, wow. I should add that I was at a much lower height (~10, I think). The solution has ~25 projects.

@filipnavara
Copy link
Member

Btw, there seems to be problem when the tasks are run in parallel on our build machine:

C:\BuildAgent\system\dotnet\.nuget\nerdbank.gitversioning\3.3.18-alpha-g2b58c634dd\build\Nerdbank.GitVersioning.targets(72,5): error MSB4018: System.IO.IOException: The process cannot access the file 'C:\BuildAgent\work\29340cd4e4d2bf36\version.cache.json' because it is being used by another process.
C:\BuildAgent\system\dotnet\.nuget\nerdbank.gitversioning\3.3.18-alpha-g2b58c634dd\build\Nerdbank.GitVersioning.targets(72,5): error MSB4018:    at System.IO.FileStream.ValidateFileHandle(SafeFileHandle fileHandle)
C:\BuildAgent\system\dotnet\.nuget\nerdbank.gitversioning\3.3.18-alpha-g2b58c634dd\build\Nerdbank.GitVersioning.targets(72,5): error MSB4018:    at System.IO.FileStream.CreateFileOpenHandle(FileMode mode, FileShare share, FileOptions options)
C:\BuildAgent\system\dotnet\.nuget\nerdbank.gitversioning\3.3.18-alpha-g2b58c634dd\build\Nerdbank.GitVersioning.targets(72,5): error MSB4018:    at System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options)
C:\BuildAgent\system\dotnet\.nuget\nerdbank.gitversioning\3.3.18-alpha-g2b58c634dd\build\Nerdbank.GitVersioning.targets(72,5): error MSB4018:    at System.IO.StreamWriter.ValidateArgsAndOpenPath(String path, Boolean append, Encoding encoding, Int32 bufferSize)
C:\BuildAgent\system\dotnet\.nuget\nerdbank.gitversioning\3.3.18-alpha-g2b58c634dd\build\Nerdbank.GitVersioning.targets(72,5): error MSB4018:    at System.IO.StreamWriter..ctor(String path, Boolean append)
C:\BuildAgent\system\dotnet\.nuget\nerdbank.gitversioning\3.3.18-alpha-g2b58c634dd\build\Nerdbank.GitVersioning.targets(72,5): error MSB4018:    at System.IO.File.CreateText(String path)
C:\BuildAgent\system\dotnet\.nuget\nerdbank.gitversioning\3.3.18-alpha-g2b58c634dd\build\Nerdbank.GitVersioning.targets(72,5): error MSB4018:    at Nerdbank.GitVersioning.GitHeightCache.SetHeight(ObjectId commitId, Int32 height)
C:\BuildAgent\system\dotnet\.nuget\nerdbank.gitversioning\3.3.18-alpha-g2b58c634dd\build\Nerdbank.GitVersioning.targets(72,5): error MSB4018:    at Nerdbank.GitVersioning.GitExtensions.GetVersionHeight(Commit commit, String repoRelativeProjectDirectory, Version baseVersion, Boolean useHeightCaching)
C:\BuildAgent\system\dotnet\.nuget\nerdbank.gitversioning\3.3.18-alpha-g2b58c634dd\build\Nerdbank.GitVersioning.targets(72,5): error MSB4018:    at Nerdbank.GitVersioning.VersionOracle.CalculateVersionHeight(String relativeRepoProjectDirectory, Commit headCommit, VersionOptions committedVersion, VersionOptions workingVersion, Boolean useHeightCaching)
C:\BuildAgent\system\dotnet\.nuget\nerdbank.gitversioning\3.3.18-alpha-g2b58c634dd\build\Nerdbank.GitVersioning.targets(72,5): error MSB4018:    at Nerdbank.GitVersioning.VersionOracle..ctor(String projectDirectory, Repository repo, Commit head, ICloudBuild cloudBuild, Nullable`1 overrideVersionHeightOffset, String projectPathRelativeToGitRepoRoot, Boolean useHeightCaching)
C:\BuildAgent\system\dotnet\.nuget\nerdbank.gitversioning\3.3.18-alpha-g2b58c634dd\build\Nerdbank.GitVersioning.targets(72,5): error MSB4018:    at Nerdbank.GitVersioning.VersionOracle.Create(String projectDirectory, String gitRepoDirectory, ICloudBuild cloudBuild, Nullable`1 overrideBuildNumberOffset, String projectPathRelativeToGitRepoRoot, Boolean useHeightCaching)
C:\BuildAgent\system\dotnet\.nuget\nerdbank.gitversioning\3.3.18-alpha-g2b58c634dd\build\Nerdbank.GitVersioning.targets(72,5): error MSB4018:    at Nerdbank.GitVersioning.Tasks.GetBuildVersion.ExecuteInner()

@filipnavara
Copy link
Member

Oh, wow. I should add that I was at a much lower height (~10, I think). The solution has ~25 projects.

It gets much worse with the height and the repetitive version.json lookups and parsing. We have around 133 projects in the solution. Some of them may not be built in a particular configuration. More than half of them are multi-targeting more than one framework though which results in additional dispatching to inner loop MSBuild processes.

@djluck
Copy link
Author

djluck commented Sep 2, 2020

Thanks for testing guys, good to know it's effective for your use case

Btw, there seems to be problem when the tasks are run in parallel on our build machine

I expected this might be an issue- I think the right thing to do is to silently swallow the exception.

@filipnavara
Copy link
Member

I expected this might be an issue- I think the right thing to do is to silently swallow the exception.

Yeah, that would likely work. Alternatively there could be some retry logic (some other MSBuild tasks do that although I don't have any specific example). I can test it if you come up with something, or prototype it and test on our machines.

@AArnott
Copy link
Collaborator

AArnott commented Sep 4, 2020

Alternatively there could be some retry logic

I doubt that would be of any use in the failed write case. Whoever is writing to the file is going to write the same content that the failed code wanted to write, right? 😏👉

@AArnott
Copy link
Collaborator

AArnott commented Sep 4, 2020

Caching the height associated with the commit and the location of the version.json file would lead to faulty cache hits when path filters are in use.
Making the 'one version.json file' case faster should be a separate PR, IMO. It aligns well with other ideas in #114. Simply caching the height based on the (project directory, commit) tuple seems like the safest and best scope for this PR.

I'm not opposed to storing this data in the .git directory, but I am a little concerned about the idea. I think I like it better than polluting the source tree with cache files (as we'd have to do if we were to store it next to the version.json file). But I think it's irrelevant anyway because given path filters, different projects can get different heights computed even from the same version.json file.
But anyway, I am optimistic we can make that much faster for the "one version.json" case. Just in another PR.

@AArnott
Copy link
Collaborator

AArnott commented Sep 4, 2020

I'm reviewing this PR now. I may push some changes to it so it can merge sooner.

@AArnott
Copy link
Collaborator

AArnott commented Sep 4, 2020

Also: regarding repeatedly parsing... I thought I already had a cache so that if the version.json blob was the same hash we'd reuse a prior parse result. If that's not there, I totally support adding it.

@AArnott
Copy link
Collaborator

AArnott commented Sep 4, 2020

After reading more about git objects (thanks for the suggestion, @qmfrederik), I think that's a better place to store the cache file than next to the version.json file. I had originally planned/expected to put the cache file in the project's intermediate output directory.
But any ordinary file has limited benefits or increased complexity when you're checking out various commits and want caching to apply across all of them. The git objects db solves this quite elegantly. Non-cached speeds will only be seen the first time you're on an uncached commit or after a git prune.

@djluck After reviewing your changes, I'd like to try implementing in a different way that I think will allow more accurate and effective caching, and possibly fit in with the existing code with fewer changes -- I'm not sure. I'm going to try it today. Whether I take this PR or not, please know your efforts are appreciated and valuable. Same with those who have tested your PR for perf improvements. We'll either take this PR or a PR that is at least as good, and having those perf measurements really helped to validate that this is a useful thing to do.

@AArnott
Copy link
Collaborator

AArnott commented Sep 4, 2020

@djluck said:

the major costs can be reading and deserializing JSON (GC costs start to stack up here too).

See #507 to fix this.

@AArnott
Copy link
Collaborator

AArnott commented Sep 4, 2020

Thinking more about the git object database, I'm having second thoughts. Since it's a content addressable store, I can't store a value keyed by something else. It's only keyed by the hash of the content, and it's the content I don't know when I need the cached value, so I can't very well predict the hash to look up the content.
So using anything under the .git directory to cache this would "not belong" there. I wouldn't want to create "corrupt" objects and store them in the object db by making up keys that aren't hashes of the content.

Now, maybe we could still store it under the .git directory but 'somewhere else'. But I'll weigh that against what I think we can get while storing cache files in other places in build output directories.

@djluck
Copy link
Author

djluck commented Sep 4, 2020

@AArnott thanks for your review and thoughts- I absolutely would not see this PR being thrown away as wasted effort and as it's my first attempt at contributing to this code base understand there are probably better ways to accomplish this PR's aims. If you want to use this PR and want to ignore the "optimize for a single version.json" use case, reverting the last commit in this PR should get us there. Let me know if there's anything I can do to help.

@djluck
Copy link
Author

djluck commented Sep 4, 2020

Caching the height associated with the commit and the location of the version.json file would lead to faulty cache hits when path filters are in use.

I think this should be safe enough if you associate the cache location with the bottom-most version.json file (which is what this PR aims to do). In the worst case, you'll be leaving some perf on the table by not making use of a cache further up the hierarchy (if paths filters are not in use) but assuming path filters are in use, the cache should be correct.

@AArnott
Copy link
Collaborator

AArnott commented Sep 4, 2020

Thank you for your spirit, @djluck.

ignore the "optimize for a single version.json" use case,

Yes, I think the single version.json file case is totally solved by #508 (provided a simple opt-in by the consuming repo). So I think I'd like to focus the remaining perf work on caching that does not require opt in and thus must preserve the same version height result in every case.
My day here is pretty much over, so I'll take a fresh look at your PR tomorrow. Thanks for the tip on rolling back just your last commit. I may start with that.

@AArnott
Copy link
Collaborator

AArnott commented Sep 24, 2020

@djluck Given the above discussion it looks like I may have dropped the ball. Did you close the PR because I took too long, or because you feel it's no longer worth reviewing?

@djluck
Copy link
Author

djluck commented Sep 25, 2020

@AArnott this wasn't intentional- I originally pushed this change to master and had to undo this in order to submit my other PR. I pushed this set of changes into a new branch (https://github.com/djluck/Nerdbank.GitVersioning/tree/version-caching-perf-improv), not sure how to update this MR to change the source branch however.

@AArnott
Copy link
Collaborator

AArnott commented Sep 25, 2020

GitHub doesn't let you change the source branch of a PR, so you'll have to open a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants