Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When BeforeClusterUpgrade is blocking an upgrade new MachineDeployments are being created with new version #8695

Closed
ykakarap opened this issue May 19, 2023 · 8 comments · Fixed by #8628
Labels
area/clusterclass Issues or PRs related to clusterclass kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@ykakarap
Copy link
Contributor

ykakarap commented May 19, 2023

What steps did you take and what happened?

Steps to reproduce:

  • Setup a RuntimeExtension that returns a blocking response for the BeforeClusterUpgrade hook
  • Create a new clusterclass based cluster at version v1.25.3
  • Edit the cluster and upgrade the version to v1.26.2 and add a new MachineDeployment
  • New MD is created with the version v1.26.2

What did you expect to happen?

The MachineDeployment should not be created with version v1.26.2 since the upgrade is blocked by the runtime extension.

Cluster API version

main

Kubernetes version

No response

Anything else you would like to add?

Since none of the other components of the cluster have picked up the new version (v1.26.2), specifically since the ControlPlane has not yet picked up the new version it looks unsafe to allow creating new MachineDeployments to be created with the new version.

Label(s) to be applied

/kind bug
/area clusterclass
/area topology
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/clusterclass Issues or PRs related to clusterclass labels May 19, 2023
@k8s-ci-robot
Copy link
Contributor

@ykakarap: The label(s) area/topology cannot be applied, because the repository doesn't have them.

In response to this:

What steps did you take and what happened?

Steps to reproduce:

  • Setup a RuntimeExtension that returns a blocking response for the BeforeClusterUpgrade hook
  • Create a new clusterclass based cluster at version v1.25.3
  • Edit the cluster and upgrade the version to v1.26.2 and add a new MachineDeployment
  • New MD is created with the version v1.26.2

What did you expect to happen?

The MachineDeployment should be created with version v1.25.3 since the upgrade is blocked by the runtime extension.

Cluster API version

main

Kubernetes version

No response

Anything else you would like to add?

Since none of the other components of the cluster have picked up the new version (v1.26.2), specifically since the ControlPlane has not yet picked up the new version it looks unsafe to allow creating new MachineDeployments to be created with the new version.

Label(s) to be applied

/kind bug
/area clusterclass
/area topology
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 19, 2023
@killianmuldoon
Copy link
Contributor

/triage accepted

Since none of the other components of the cluster have picked up the new version (v1.26.2), specifically since the ControlPlane has not yet picked up the new version it looks unsafe to allow creating new MachineDeployments to be created with the new version.

Completely agree. Would it be easiest to prevent this in the upgrade part of the topology code? i.e. to never allow an MD to get ahead of the control plane version, instead of trying to make the runtime hook call more conditional / complex.

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 19, 2023
@sbueringer
Copy link
Member

sbueringer commented May 19, 2023

The MachineDeployment should be created with version v1.25.3 since the upgrade is blocked by the runtime extension.

I think this conflicts with our work on: #8656

Seems to me the right thing to do is to not create the MD until CP is upgraded.

@killianmuldoon
Copy link
Contributor

Seems to me the right thing to do is to not create the MD until CP is upgraded.

This seems like it could have unintended consequences - the runtime extension could be blocking to wait for some maintenance window or similar. I think when we implemented this hook the idea was to ensure that no other Cluster operations are blocked.

When this hook is blocking the CP can't be upgraded so new machines should be able to join etc.

@sbueringer
Copy link
Member

sbueringer commented May 19, 2023

The problem is that if we create an MD with 1.25.3 but with the new state of the Cluster topology we end up with mixed configuration. The 1.25.3 version with whatever else a user changed in Cluster topology.

It was a major goal of avoiding double rollouts to avoid those scenarios.

@sbueringer
Copy link
Member

This seems like it could have unintended consequences - the runtime extension could be blocking to wait for some maintenance window or similar. I think when we implemented this hook the idea was to ensure that no other Cluster operations are blocked.

I think a consequence of holding back "other changes" to roll them out together with the version is that if the version is blocked by a hook "other changes" are blocked as well. So we will definitely block operations like scaling up / changing config of existing MDs. That's the idea of avoiding double rollouts.

@ykakarap
Copy link
Contributor Author

ykakarap commented May 19, 2023

Would it be easiest to prevent this in the upgrade part of the topology code? i.e. to never allow an MD to get ahead of the control plane version, instead of trying to make the runtime hook call more conditional / complex.

Yes. I agree that the right way to approach this is to handle the case in the topology controller and not try to make the hook call more complex.

The MachineDeployment should be created with version v1.25.3 since the upgrade is blocked by the runtime extension.

I think this conflicts with our work on: #8656

Seems to me the right thing to do is to not create the MD until CP is upgraded.

I should have worded this better. The correct expectation is a MD should not be created with version v1.26.2.

To align with #8656 we should not create the new MachineDeployment.

@fabriziopandini
Copy link
Member

fabriziopandini commented May 22, 2023

I agree.
The topology controller is responsible for the entire cluster, and given that many users perceived as a bug double rollouts happening, #8656 describes this in detail, I think we should be consistent and prefer stability over the speed (btw this is the same approach we are using in KCP)

What is key IMO is to clearly surface what the topology controller is doing in conditions at the cluster level; then, after we gather some more feedback we can decide if and how to allow users to relax some of the checks if necessary.

We should also make it clear in the BeforeClusterUpgrade hook what are the implications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterclass Issues or PRs related to clusterclass kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants