Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spot instances have status of Ready,SchedulingDisabled #560

Closed
sudo-justinwilson opened this issue Nov 12, 2020 · 3 comments
Closed

Spot instances have status of Ready,SchedulingDisabled #560

sudo-justinwilson opened this issue Nov 12, 2020 · 3 comments

Comments

@sudo-justinwilson
Copy link

sudo-justinwilson commented Nov 12, 2020

What happened:
I'm running kubernetes 1.17 and use an autoscaling group that has a mix of OnDemand and spot instances that both use the amazon-eks-node-1.14-v20201007 AMI. Some of the spot instances have a SchedulingDisabled status, which apparently indicates a node has been cordoned off, but I am certain that nobody has done this:

ip-10-2-34-223.ap-southeast-2.compute.internal   Ready,SchedulingDisabled   <none>   114m    v1.14.9-eks-cc7316   10.2.34.223   <none>        Amazon Linux 2   4.14.198-152.320.amzn2.x86_64   docker://19.3.6

What you expected to happen:
I expect the nodes to have a Ready status.

How to reproduce it (as minimally and precisely as possible):
On a kubernetes 1.18 EKS cluster, launch a worker node using the amazon-eks-node-1.14-v20201007 AMI, with the following user-data:

#!/bin/bash
set -o xtrace
export AWS_DEFAULT_REGION="$(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | grep -oP '\"region\"[[:space:]]*:[[:space:]]*\"\K[^\"]+')"
iid=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
ilc=`aws ec2 describe-instances --instance-ids  $iid  --query 'Reservations[0].Instances[0].InstanceLifecycle' --output text`
if [ "$ilc" == "spot" ]; then
  /etc/eks/bootstrap.sh --kubelet-extra-args '--node-labels=lifecycle=Spot --cluster-dns=169.254.20.10 --register-with-taints=spotInstance=true:PreferNoSchedule' --apiserver-endpoint '${aws_eks_cluster.eks.endpoint}' --b64-cluster-ca '${aws_eks_cluster.eks.certificate_authority[0].data}' 'eks-${var.environment}'
else
  /etc/eks/bootstrap.sh --kubelet-extra-args '--node-labels=lifecycle=OnDemand --cluster-dns=169.254.20.10' --apiserver-endpoint '${aws_eks_cluster.eks.endpoint}' --b64-cluster-ca '${aws_eks_cluster.eks.certificate_authority[0].data}' 'eks-${var.environment}'
fi

Anything else we need to know?:

  1. The nodes are underutilised:
NAME                                             CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
ip-10-2-34-223.ap-southeast-2.compute.internal   116m         6%     813Mi           11%
  1. Here are the node taints:
Taints:             node.kubernetes.io/unschedulable:NoSchedule
                    spotInstance=true:PreferNoSchedule
Unschedulable:      true
  1. Here are the events:
Events:
  Type    Reason              Age   From     Message
  ----    ------              ----  ----     -------
  Normal  NodeNotSchedulable  58m   kubelet  Node ip-10-2-34-223.ap-southeast-2.compute.internal status is now: NodeNotSchedulable
  1. I use cluster-autoscaler, but there's nothing about it in the logs:
...
I1112 07:43:48.203376       1 scale_down.go:421] Node ip-10-2-34-223.ap-southeast-2.compute.internal - cpu utilization 0.924870
I1112 07:43:48.203389       1 scale_down.go:424] Node ip-10-2-34-223.ap-southeast-2.compute.internal is not suitable for removal - cpu utilization too big (0.924870)
...

Environment:

  • AWS Region: ap-southeast-2
  • Instance Type(s): m5.large
  • EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.2
  • Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): 1.17
  • AMI Version: ami-087315adc4086bcef
  • Kernel (e.g. uname -a): Linux ip-10-2-34-223.ap-southeast-2.compute.internal 4.14.198-152.320.amzn2.x86_64 Template is missing source_ami_id in the variables section #1 SMP Wed Sep 23 23:57:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
  • Release information (run cat /etc/eks/release on a node):
BASE_AMI_ID="ami-01f93ce28477e7be1"
BUILD_TIME="Wed Oct  7 19:11:36 UTC 2020"
BUILD_KERNEL="4.14.198-152.320.amzn2.x86_64"
ARCH="x86_64"
BASE_AMI_ID="ami-01f93ce28477e7be1"
BUILD_TIME="Wed Oct  7 19:11:36 UTC 2020"
BUILD_KERNEL="4.14.198-152.320.amzn2.x86_64"
ARCH="x86_64"
@sudo-justinwilson
Copy link
Author

Is this issue in the correct github repo? I'd be happy to post it in another place if need be..

@rtripat
Copy link
Contributor

rtripat commented Nov 20, 2020

@sudo-justinwilson How have these nodes been created? Managed nodegroups, self managed nodes with aws-node-termination-handler deployed?

@sudo-justinwilson
Copy link
Author

Hi @rtripat. Thanks for your reply. I've isolated this problem to cluster-autoscaler. So I will close this issue and open it up in the cluster-autoscaler repository. Cheers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants