Spot instances have status of Ready,SchedulingDisabled #560

sudo-justinwilson · 2020-11-12T07:40:58Z

What happened:
I'm running kubernetes 1.17 and use an autoscaling group that has a mix of OnDemand and spot instances that both use the amazon-eks-node-1.14-v20201007 AMI. Some of the spot instances have a SchedulingDisabled status, which apparently indicates a node has been cordoned off, but I am certain that nobody has done this:

ip-10-2-34-223.ap-southeast-2.compute.internal   Ready,SchedulingDisabled   <none>   114m    v1.14.9-eks-cc7316   10.2.34.223   <none>        Amazon Linux 2   4.14.198-152.320.amzn2.x86_64   docker://19.3.6

What you expected to happen:
I expect the nodes to have a Ready status.

How to reproduce it (as minimally and precisely as possible):
On a kubernetes 1.18 EKS cluster, launch a worker node using the amazon-eks-node-1.14-v20201007 AMI, with the following user-data:

#!/bin/bash
set -o xtrace
export AWS_DEFAULT_REGION="$(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | grep -oP '\"region\"[[:space:]]*:[[:space:]]*\"\K[^\"]+')"
iid=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
ilc=`aws ec2 describe-instances --instance-ids  $iid  --query 'Reservations[0].Instances[0].InstanceLifecycle' --output text`
if [ "$ilc" == "spot" ]; then
  /etc/eks/bootstrap.sh --kubelet-extra-args '--node-labels=lifecycle=Spot --cluster-dns=169.254.20.10 --register-with-taints=spotInstance=true:PreferNoSchedule' --apiserver-endpoint '${aws_eks_cluster.eks.endpoint}' --b64-cluster-ca '${aws_eks_cluster.eks.certificate_authority[0].data}' 'eks-${var.environment}'
else
  /etc/eks/bootstrap.sh --kubelet-extra-args '--node-labels=lifecycle=OnDemand --cluster-dns=169.254.20.10' --apiserver-endpoint '${aws_eks_cluster.eks.endpoint}' --b64-cluster-ca '${aws_eks_cluster.eks.certificate_authority[0].data}' 'eks-${var.environment}'
fi

Anything else we need to know?:

The nodes are underutilised:

NAME                                             CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-2-34-223.ap-southeast-2.compute.internal   116m         6%     813Mi           11%

Here are the node taints:

Taints:             node.kubernetes.io/unschedulable:NoSchedule
                    spotInstance=true:PreferNoSchedule
Unschedulable:      true

Here are the events:

Events:
  Type    Reason              Age   From     Message
  ----    ------              ----  ----     -------
  Normal  NodeNotSchedulable  58m   kubelet  Node ip-10-2-34-223.ap-southeast-2.compute.internal status is now: NodeNotSchedulable

I use cluster-autoscaler, but there's nothing about it in the logs:

...
I1112 07:43:48.203376       1 scale_down.go:421] Node ip-10-2-34-223.ap-southeast-2.compute.internal - cpu utilization 0.924870
I1112 07:43:48.203389       1 scale_down.go:424] Node ip-10-2-34-223.ap-southeast-2.compute.internal is not suitable for removal - cpu utilization too big (0.924870)
...

Environment:

AWS Region: ap-southeast-2
Instance Type(s): m5.large
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.2
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.17
AMI Version: ami-087315adc4086bcef
Kernel (e.g. uname -a): Linux ip-10-2-34-223.ap-southeast-2.compute.internal 4.14.198-152.320.amzn2.x86_64 Template is missing source_ami_id in the variables section #1 SMP Wed Sep 23 23:57:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Release information (run cat /etc/eks/release on a node):

BASE_AMI_ID="ami-01f93ce28477e7be1"
BUILD_TIME="Wed Oct  7 19:11:36 UTC 2020"
BUILD_KERNEL="4.14.198-152.320.amzn2.x86_64"
ARCH="x86_64"

BASE_AMI_ID="ami-01f93ce28477e7be1"
BUILD_TIME="Wed Oct  7 19:11:36 UTC 2020"
BUILD_KERNEL="4.14.198-152.320.amzn2.x86_64"
ARCH="x86_64"

The text was updated successfully, but these errors were encountered:

sudo-justinwilson · 2020-11-15T23:39:09Z

Is this issue in the correct github repo? I'd be happy to post it in another place if need be..

rtripat · 2020-11-20T17:21:48Z

@sudo-justinwilson How have these nodes been created? Managed nodegroups, self managed nodes with aws-node-termination-handler deployed?

sudo-justinwilson · 2020-11-23T01:18:58Z

Hi @rtripat. Thanks for your reply. I've isolated this problem to cluster-autoscaler. So I will close this issue and open it up in the cluster-autoscaler repository. Cheers.

sudo-justinwilson closed this as completed Nov 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spot instances have status of Ready,SchedulingDisabled #560

Spot instances have status of Ready,SchedulingDisabled #560

sudo-justinwilson commented Nov 12, 2020 •

edited

Loading

sudo-justinwilson commented Nov 15, 2020

rtripat commented Nov 20, 2020

sudo-justinwilson commented Nov 23, 2020

Spot instances have status of Ready,SchedulingDisabled #560

Spot instances have status of Ready,SchedulingDisabled #560

Comments

sudo-justinwilson commented Nov 12, 2020 • edited Loading

sudo-justinwilson commented Nov 15, 2020

rtripat commented Nov 20, 2020

sudo-justinwilson commented Nov 23, 2020

sudo-justinwilson commented Nov 12, 2020 •

edited

Loading