Upgrade causes POD failures if the cluster scales during the master upgrade. #7323

rdrgmnzs · 2019-07-25T23:14:06Z

1. What kops version are you running? The command kops version, will display
this information.
1.13.0-beta.2

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
v1.12 -> 1.13

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
After setting the new config to k8s 1.13 I started a cluster rotation with:
kops rolling-update cluster --yes

5. What happened after the commands executed?
The master rotation started upgrading the master nodes. However because I make use of the cluster-autoscaler new nodes started coming up with 1.13 before all the master nodes were upgraded. This would normally have been fine, however due to the changes made in kubernetes/kubernetes#74529 it requires the kubelet version to be at the same version or older than the API. Because of this change, PODs deployed to these new hosts brought up by auto-scaling fail with Error: nil pod.spec.enableServiceLinks encountered, cannot construct envvars

Kops k8s versioning is global, so if there is no way to upgrade just the masters before you upgrade the nodes if you use auto-scaling.

6. What did you expect to happen?
For k8s to be able to handle the upgrade with newer version of kubelet coming up.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
N/A

8. Please run the commands with most verbose logging by adding the -v 10 flag.
Paste the logs into this report, or in a gist and provide the gist link here.
N/A

9. Anything else do we need to know?

The text was updated successfully, but these errors were encountered:

Nuru · 2019-09-05T19:33:18Z

I had a similar problem with a kops cluster running in AWS.

I used kops 1.13.0 to upgrade the cluster from 1.12.8 to 1.13.10 while at the same time I added new node instance groups so I could later install the cluster autoscaler to manage them. Specifically, I edited the cluster manifest to change the Kubernetes version number, then used kops replace --force to install new instance groups, then ran kops update cluster --yes and kops rolling-update cluster --yes.

The new IG nodes came up with 1.13.10 before the masters were updated and failed to join the cluster, causing the cluster to fail validation and halting the rolling update.

I was unable to pinpoint the exact cause of the failure, but the stuck state was that the instance running 1.13.10 reported:

KubeletNotReady  runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

we are specifying calico networking in our manifest:

  networking:
    calico: {}

but the calico pod won't initialize

NAME               READY   STATUS    RESTARTS  AGE
calico-node-hsft6  0/1     Init:0/1  0         10h

kubectl describe pod calico-node-hsft6 show the calico init container, which should install the cni, is waiting to run.

Init Containers:
  install-cni:
    Container ID:  
    Image:         quay.io/calico/cni:v3.7.4
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /install-cni.sh
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
...
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True

This does not make sense to me. According to the documentation, the state PodInitializing means "The Pod has already finished executing Init Containers" but this container appears not to have been run.

I notice ContainersReady is false, but I'm not sure what to make of that. It looks to me like Docker is complaining about the cni not being initialized, and that is preventing the containers from finalizing, but of course that should be expected if you are going to use a container to initialize the cni.

I don't know enough about the startup sequence to debug this further. I welcome suggestions.

steps to reproduce

This is easy to reproduce, given a cluster running Kubernetes 1.12.8 using calico networking.

Edit the cluster to set the Kubernetes version to 1.13.10
Edit a node instance group minSize and maxSize so that the AWS autoscaler will create a new node
Run kops update cluster --yes

That's it. Do NOT run kops rolling-update cluster.

The new node will be created but fail to join the cluster.

fejta-bot · 2019-12-04T19:37:47Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

rdrgmnzs · 2019-12-04T19:50:22Z

/remove-lifecycle stale

fejta-bot · 2020-03-03T20:47:35Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-04-02T21:30:07Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-05-02T22:13:33Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-05-02T22:13:48Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rifelpet · 2020-06-15T15:41:33Z

We just experienced this during an upgrade from 1.16 -> 1.17.

It seems like we need a way to update the userdata of the master ASGs in a separate step from the node ASGs, that way any node autoscaling would create new nodes with the old k8s version until all masters have been upgraded to the new k8s version.

kops update cluster --yes      # somehow only update master ASGs + dependencies

kops rolling-update cluster --yes --instance-group-roles=Master

kops update cluster --yes      # somehow only update node ASGs + dependencies

kops rolling-update cluster --yes

An implement workaround could be disabling the cluster-autoscaler before kops update cluster, though that could leave pods unschedulable and wouldn't help if a node was terminated due to failing ASG healthcheck.

I also wonder how #8198 would affect this, with kops-controller serving artifacts to nodes we'll need to make sure they're receiving the desired version of those artifacts during an upgrade.

/reopen

k8s-ci-robot · 2020-06-15T15:41:47Z

@rifelpet: Reopened this issue.

In response to this:

We just experienced this during an upgrade from 1.16 -> 1.17.

It seems like we need a way to update the userdata of the master ASGs in a separate step from the node ASGs, that way any node autoscaling would create new nodes with the old k8s version until all masters have been upgraded to the new k8s version.
kops update cluster --yes      # somehow only update master ASGs + dependencies

kops rolling-update cluster --yes --instance-group-roles=Master

kops update cluster --yes      # somehow only update node ASGs + dependencies

kops rolling-update cluster --yes
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

olemarkus · 2020-06-16T08:58:22Z

Kops does upgrades ASG by ASG, and the master ones before the node ones. I think a quite a lot would benefit from having a step in between rolling updates of each ASG

fejta-bot · 2020-07-16T09:17:11Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-07-16T09:17:25Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 4, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 4, 2019

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 3, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 2, 2020

k8s-ci-robot closed this as completed May 2, 2020

k8s-ci-robot reopened this Jun 15, 2020

rifelpet changed the title ~~Upgrade from 1.12 to 1.13+ cause POD failures if the cluster scales during the master upgrade.~~ Upgrade causes POD failures if the cluster scales during the master upgrade. Jun 15, 2020

k8s-ci-robot closed this as completed Jul 16, 2020

rifelpet mentioned this issue Feb 5, 2021

aws: Graceful handling of EC2 detach errors #10740

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade causes POD failures if the cluster scales during the master upgrade. #7323

Upgrade causes POD failures if the cluster scales during the master upgrade. #7323

rdrgmnzs commented Jul 25, 2019 •

edited

Loading

Nuru commented Sep 5, 2019

fejta-bot commented Dec 4, 2019

rdrgmnzs commented Dec 4, 2019

fejta-bot commented Mar 3, 2020

fejta-bot commented Apr 2, 2020

fejta-bot commented May 2, 2020

k8s-ci-robot commented May 2, 2020

rifelpet commented Jun 15, 2020 •

edited

Loading

k8s-ci-robot commented Jun 15, 2020

olemarkus commented Jun 16, 2020

fejta-bot commented Jul 16, 2020

k8s-ci-robot commented Jul 16, 2020

Upgrade causes POD failures if the cluster scales during the master upgrade. #7323

Upgrade causes POD failures if the cluster scales during the master upgrade. #7323

Comments

rdrgmnzs commented Jul 25, 2019 • edited Loading

Nuru commented Sep 5, 2019

steps to reproduce

fejta-bot commented Dec 4, 2019

rdrgmnzs commented Dec 4, 2019

fejta-bot commented Mar 3, 2020

fejta-bot commented Apr 2, 2020

fejta-bot commented May 2, 2020

k8s-ci-robot commented May 2, 2020

rifelpet commented Jun 15, 2020 • edited Loading

k8s-ci-robot commented Jun 15, 2020

olemarkus commented Jun 16, 2020

fejta-bot commented Jul 16, 2020

k8s-ci-robot commented Jul 16, 2020

rdrgmnzs commented Jul 25, 2019 •

edited

Loading

rifelpet commented Jun 15, 2020 •

edited

Loading