Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade causes POD failures if the cluster scales during the master upgrade. #7323

Closed
rdrgmnzs opened this issue Jul 25, 2019 · 12 comments
Closed
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@rdrgmnzs
Copy link
Contributor

rdrgmnzs commented Jul 25, 2019

1. What kops version are you running? The command kops version, will display
this information.

1.13.0-beta.2

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

v1.12 -> 1.13

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
After setting the new config to k8s 1.13 I started a cluster rotation with:
kops rolling-update cluster --yes

5. What happened after the commands executed?
The master rotation started upgrading the master nodes. However because I make use of the cluster-autoscaler new nodes started coming up with 1.13 before all the master nodes were upgraded. This would normally have been fine, however due to the changes made in kubernetes/kubernetes#74529 it requires the kubelet version to be at the same version or older than the API. Because of this change, PODs deployed to these new hosts brought up by auto-scaling fail with Error: nil pod.spec.enableServiceLinks encountered, cannot construct envvars

Kops k8s versioning is global, so if there is no way to upgrade just the masters before you upgrade the nodes if you use auto-scaling.

6. What did you expect to happen?
For k8s to be able to handle the upgrade with newer version of kubelet coming up.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

N/A

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.

N/A

9. Anything else do we need to know?

@Nuru
Copy link

Nuru commented Sep 5, 2019

I had a similar problem with a kops cluster running in AWS.

I used kops 1.13.0 to upgrade the cluster from 1.12.8 to 1.13.10 while at the same time I added new node instance groups so I could later install the cluster autoscaler to manage them. Specifically, I edited the cluster manifest to change the Kubernetes version number, then used kops replace --force to install new instance groups, then ran kops update cluster --yes and kops rolling-update cluster --yes.

The new IG nodes came up with 1.13.10 before the masters were updated and failed to join the cluster, causing the cluster to fail validation and halting the rolling update.

I was unable to pinpoint the exact cause of the failure, but the stuck state was that the instance running 1.13.10 reported:

KubeletNotReady  runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

we are specifying calico networking in our manifest:

  networking:
    calico: {}

but the calico pod won't initialize

NAME               READY   STATUS    RESTARTS  AGE
calico-node-hsft6  0/1     Init:0/1  0         10h

kubectl describe pod calico-node-hsft6 show the calico init container, which should install the cni, is waiting to run.

Init Containers:
  install-cni:
    Container ID:  
    Image:         quay.io/calico/cni:v3.7.4
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
...
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 

This does not make sense to me. According to the documentation, the state PodInitializing means "The Pod has already finished executing Init Containers" but this container appears not to have been run.

I notice ContainersReady is false, but I'm not sure what to make of that. It looks to me like Docker is complaining about the cni not being initialized, and that is preventing the containers from finalizing, but of course that should be expected if you are going to use a container to initialize the cni.

I don't know enough about the startup sequence to debug this further. I welcome suggestions.

steps to reproduce

This is easy to reproduce, given a cluster running Kubernetes 1.12.8 using calico networking.

  • Edit the cluster to set the Kubernetes version to 1.13.10
  • Edit a node instance group minSize and maxSize so that the AWS autoscaler will create a new node
  • Run kops update cluster --yes

That's it. Do NOT run kops rolling-update cluster.

The new node will be created but fail to join the cluster.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 4, 2019
@rdrgmnzs
Copy link
Contributor Author

rdrgmnzs commented Dec 4, 2019

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 4, 2019
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 2, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rifelpet
Copy link
Member

rifelpet commented Jun 15, 2020

We just experienced this during an upgrade from 1.16 -> 1.17.

It seems like we need a way to update the userdata of the master ASGs in a separate step from the node ASGs, that way any node autoscaling would create new nodes with the old k8s version until all masters have been upgraded to the new k8s version.

kops update cluster --yes      # somehow only update master ASGs + dependencies

kops rolling-update cluster --yes --instance-group-roles=Master

kops update cluster --yes      # somehow only update node ASGs + dependencies

kops rolling-update cluster --yes

An implement workaround could be disabling the cluster-autoscaler before kops update cluster, though that could leave pods unschedulable and wouldn't help if a node was terminated due to failing ASG healthcheck.

I also wonder how #8198 would affect this, with kops-controller serving artifacts to nodes we'll need to make sure they're receiving the desired version of those artifacts during an upgrade.

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Jun 15, 2020
@k8s-ci-robot
Copy link
Contributor

@rifelpet: Reopened this issue.

In response to this:

We just experienced this during an upgrade from 1.16 -> 1.17.

It seems like we need a way to update the userdata of the master ASGs in a separate step from the node ASGs, that way any node autoscaling would create new nodes with the old k8s version until all masters have been upgraded to the new k8s version.

kops update cluster --yes      # somehow only update master ASGs + dependencies

kops rolling-update cluster --yes --instance-group-roles=Master

kops update cluster --yes      # somehow only update node ASGs + dependencies

kops rolling-update cluster --yes

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rifelpet rifelpet changed the title Upgrade from 1.12 to 1.13+ cause POD failures if the cluster scales during the master upgrade. Upgrade causes POD failures if the cluster scales during the master upgrade. Jun 15, 2020
@olemarkus
Copy link
Member

Kops does upgrades ASG by ASG, and the master ones before the node ones. I think a quite a lot would benefit from having a step in between rolling updates of each ASG

@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

6 participants