Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NuGet feed non-determinism is preventing correct source build behavior #4206

Closed
mthalman opened this issue Mar 11, 2024 · 11 comments · Fixed by dotnet/installer#19114
Closed
Assignees
Labels
area-infra Source-build infrastructure and reporting

Comments

@mthalman
Copy link
Member

mthalman commented Mar 11, 2024

The changes to update to a newer SDK version for the VMR revealed issues with how NuGet is loading packages from the local feeds defined by source build. The results of builds with those changes yielded builds that would sometimes succeed and sometimes fail. The ones which failed showed a lot of poison leaks.

Let's take a look at just one of those leaks:

  <File Path="artifacts/assets/Release/dotnet-sdk-x.y.z/sdk/x.y.z/DotnetTools/dotnet-watch/x.y.z/tools/netx.y/any/BuildHost-netcore/Humanizer.dll">
    <Type>AssemblyAttribute</Type>
  </File>

This comes from Roslyn which has a dependency on Humanizer. Humanizer is originally defined in source-build-externals but it also exists in the previously source built artifacts with the same version since the version hasn't been updated in quite a while.

Here's what the source-built-modified NuGet.config file looks like:

    <add key="source-built-transport-runtime" value="/vmr/artifacts/packages/Release/NonShipping//runtime/" />
    <add key="source-built-transport-emsdk" value="/vmr/artifacts/packages/Release/NonShipping//emsdk/" />
    <add key="source-built-transport-cecil" value="/vmr/artifacts/packages/Release/NonShipping//cecil/" />
    <add key="source-built-transport-arcade" value="/vmr/artifacts/packages/Release/NonShipping//arcade/" />
    <add key="source-built-runtime" value="/vmr/artifacts/packages/Release/Shipping//runtime/" />
    <add key="source-built-source-build-externals" value="/vmr/artifacts/packages/Release/Shipping//source-build-externals/" />
    <add key="source-built-emsdk" value="/vmr/artifacts/packages/Release/Shipping//emsdk/" />
    <add key="source-built-command-line-api" value="/vmr/artifacts/packages/Release/Shipping//command-line-api/" />
    <add key="reference-packages" value="/vmr/prereqs/packages/reference/" />
    <add key="previously-source-built" value="/vmr/prereqs/packages/previously-source-built/" />
    <add key="prebuilt" value="/vmr/prereqs/packages/prebuilt/" />

Prior to bootstrapping onto this newer SDK version, the behavior of source build would always load the Humanizer package from the source-built-source-build-externals feed since that is the first feed listed containing this package. But with newer SDK version, there is likely some NuGet change which no longer is consistent wrt the ordering of these local feeds. This non-determinism has always been true and has been documented. Before the implementation seemed to be that local feeds were sequentially processed in order. That no longer seems to be the case. So in some builds we now load Humanizer from the previously-source-built feed. This leads to the poison leak.

The theory is that this can be fixed with the use of package source mappings. The tricky thing with package source mappings is that they operate at the package ID (e.g. name) level which doesn't include the version. So you can't say I want to get this specific package with this specific version from this specific feed. Instead, we'll need to define some rules of which package IDs are defined for which feeds based on the N-1 and current packages.

The good news is that we already have some infrastructure to update package source mappings. That would need to be extended to be more comprehensive. The current logic is meant to integrate with those repos which have NuGet.config files that use package source mappings. So it's necessary to also account for those.

The important part of this logic is that we need to avoid situations where a package source mapping is defined more than once such that nondeterminism would be introduced. Its not inherently bad to have more than package source mapping for a given package ID. But that's only true if you explicitly know that there is no version conflict between those two feeds.

Some rules:

  • If a package exists in the currently built package list, add a package source mapping for it that targets its source-built feed.
  • If a package exists in the N-1 package list, add a package source mapping for it that targets the previously-source-built feed. Only do this if a package source mapping doesn't already exist for current.
  • For those packages that come from SBRP, these are guaranteed to not have conflicting versions between current and N-1 so these can always be added.
  • For any of the above cases where a package source mapping is added for a package ID, remove any existing package source mapping with that package ID from the originally defined NuGet.config. In other words, a product repo may have defined a package source mapping for a package. But if we also define that package in SBRP, for example, we want SBRP to be used instead.

Here's an example build (internal only) demonstrating the poison leaks.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-infra Source-build infrastructure and reporting untriaged labels Mar 11, 2024
@MichaelSimons
Copy link
Member

One other point is that a wild card package source mapping will need to be added in online builds for the repo's existing feeds. This needs to be done for repo's that don't already use package source mappings. This is because you can't have a mixture of non-mapping feeds and mapping feeds.

@ViktorHofer
Copy link
Member

But with newer SDK version, there is likely some NuGet change which no longer is consistent wrt the ordering of these local feeds. This non-determinism has always been true and has been documented. Before the implementation seemed to be that local feeds were sequentially processed in order. That no longer seems to be the case.

cc @dotnet/nuget-team

@zivkan
Copy link
Member

zivkan commented Mar 14, 2024

Humanizer is originally defined in source-build-externals but it also exists in the previously source built artifacts with the same version since the version hasn't been updated in quite a while.

I'm curious why the same package id and same version exists in different contents with different contents? Yes, Package Source Mapping is supposed to help with this, but most customers aren't re-building packages that already exist. I get that .NET and source build need to do things that most customers won't ever need to do themselves.

But would it be easier to focus on why the package with different behaviour exists in the first place? Shouldn't source-build be re-creating packages that are 100% compatible with the original that they're building?

There's a good chance that my change caused the unintended change in behaviour in this commit: NuGet/NuGet.Client@1942c17

Another customer (sdk repo?) found another issue, which was traced to two different sources using the same URL, but different keys (in nuget.config, sources are added with <add name="key" value="url" />). Perhaps the code that previously deduplicated the sources was doing so which changed the order that nuget uses internally when running the restore.

@mthalman
Copy link
Member Author

The content in Humanizer is the same between these two packages. That's not the issue. The difference is, in one feed Humanizer comes from the previously released artifacts, and in the other feed it contains the current source of Humanizer just built. The content between these two packages is the same but because source build has strict rules to ensure patchability of the source code, we cannot allow packages that come from the previously released artifacts to show up in the output of the current build. They can only be used for referencing. If they do show up in the output, it's known as poisoning.

Interestingly enough, there is another scenario that is broken by this which does have content differences and that comes from nuget-client itself. This issue showed up in our tests: dotnet/installer#18763 (comment). There was an API update in the NuGet.Commands package that NuGet.Build.Tasks was updated to be dependent on. But during the build, we ended up loading the old version of the NuGet.Commands package from the previously released artifacts feed which had the old API and was now incompatible with NuGet.Build.Tasks. The two packages had the same version between the two feeds because of however nuget-client does its package versioning. They both had a package version of 6.0.10-preview.2.32767 even though they came from different commits.

@zivkan
Copy link
Member

zivkan commented Mar 14, 2024

there is another scenario that is broken by this which does have content differences and that comes from nuget-client itself.

As you know, the NuGet.Client repo doesn't use the Arcade SDK. I have no idea how the Arcade SDK does versioning, but with NuGet Client, we use Azure DevOps' counter feature to give us a unique number every build, even for the same commit. When building the product, we expect BuildNumber to be passed on the command line. When you don't pass a build number, it intentionally uses ushort.Max, in order to simplify debugging, especially our VS extension. It doesn't need to be the same number every commit, but it does need to be a number that's always higher than any version that we've actually inserted into Visual Studio.

We're not opposed to using the Arcade SDK (well, except for pinning to a specific .NET SDK), but it's just not high on our priority list given the high cost. You've probably had this conversation before, but I imagine that if another team that benefits from the migration funds the effort (does it and creates a PR for us to merge), I don't see any major reason why we wouldn't take it. But any changes to build artifact names will need to be cherry-picked onto all of our servicing branches, so that our release/publishing pipelines still work. If someone is seriously going to do this work, we can answer any questions about the build scripts to help out.

@MichaelSimons
Copy link
Member

Capturing in-person team discussion:

  1. Source-build scenarios will always have non-identical packages with the same identity/version from multiple feeds. e.g. for source-build externals the package will exist in the current feed and n-1 feed. In order for this to not flag issues in the poisoning infra, we must use the package from current. The proposed package source mapping solution will resolve this issue.
  2. Adding the package source mappings is not a quick fix. To mitigate for P3, a source-build NuGet patch will be carried to revert the offending change.
  3. In P4, the package source mapping solution will be implemented and the patch will be removed.

@NikolaMilosavljevic
Copy link
Member

@zivkan I'm trying to confirm that the commit you mentioned above (NuGet/NuGet.Client@1942c17) is the root cause. I tried to do a simple git revert without much luck due to other refactoring changes that were made to the same source files - too many hard-to-untangle merge conflicts.

If we do know that this is the commit, I'd like to understand which part of it. Can you help us figure this out so we can do a scoped down fix?

I've checked the binlogs from latest builds and the list of sources dumped by RestoreTask under "Feeds used:" looks correct - matches the order in NuGet.config. Restore logging (i.e. "Installed 'package' from 'feed' ...") does show incorrect (or unexpected) feeds being used.

@zivkan
Copy link
Member

zivkan commented Mar 15, 2024

It was just a guess that commit might be responsible. There's also been a lot of performance improvements in the graph resolver, and I can't rule out that any of those changes might have changed the order of local feeds used.

If it's not too difficult to re-test restore on the VMR, you can "easily" test different commits on NuGet with the following:

git checkout HEAD~1 # or a specific commit, if you prefer
dotnet build -c Release path/to/NuGet.Client/src/NuGet.Core/NuGet.Build.Tasks.Console
dotnet restore -p:NuGetRestoreTargets="path/to/NuGet.Client/artifacts/NuGet.Build.Tasks.Console/bin/Release/net7.0/NuGet.targets"

Then repeat until you find a commit that doesn't have the changed behaviour, then you know the previous one was the culprit.

@NikolaMilosavljevic
Copy link
Member

This seems to be the commit that introduced the regression: NuGet/NuGet.Client@2d388e3

Will try to patch-revert and test in full source-build.

@NikolaMilosavljevic
Copy link
Member

NikolaMilosavljevic commented Mar 15, 2024

This seems to be the commit that introduced the regression: NuGet/NuGet.Client@2d388e3

Will try to patch-revert and test in full source-build.

It's not it - it just didn't repro on first try. Continuing search.

@NikolaMilosavljevic
Copy link
Member

OK - this commit is definitely the culprit: NuGet/NuGet.Client@f54c58f

Not sure how easy it would be to revert-patch, but I'm going to try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-infra Source-build infrastructure and reporting
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants