Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SelectorSpreadPriority does not spread pods across zones #71327

Closed
Ramyak opened this issue Nov 21, 2018 · 7 comments · Fixed by #72801
Closed

SelectorSpreadPriority does not spread pods across zones #71327

Ramyak opened this issue Nov 21, 2018 · 7 comments · Fixed by #72801
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@Ramyak
Copy link
Contributor

Ramyak commented Nov 21, 2018

What happened:
/kind bug
What you expected to happen:

  • Problem 1: Predicates filter nodes. Existing pods on those nodes will be not be counted when calculating max pods per (zone or node) resulting in imbalanced cluster.
    Eg: GeneralPredicates removes nodes which cannot fit this pod. If any of the pods are already scheduled on this node, they are not considered when counting max pods in CalculateSpreadPriorityReduce.

  • Problem 2: When there are 2 selectors(service and replication controller), it is sufficient to match any one selector for distribution. This creates imbalance [selector match code].
    Pods from previous deploys matches service selector and are counted when distributing pods across zones/nodes (Even though they do not match replicaset selector) . These pods will be deleted. After the deploy completes, the cluster is imbalanced - by zone and/or pods per node.

service selector: {app:sd-status-staging}
rc selector  {app: sd-status-staging,pod-template-hash: 4090075901}

How to reproduce it (as minimally and precisely as possible):

  • Problem 1:
    Get pods to get scheduled on nodes with very high cpu utilization [Existing cpu utilization + 1 new pod will result in allocatable.MilliCPU almost equal to 0].
    GeneralPredicate will drop this node after scheduling the first pod.
    This will lead to over utilization of an already loaded zone.
    If there are enough nodes like this one, it will create a pile on effect resulting in most pods scheduled in an already loaded zone.

  • Problem 2:
    Have 2 selectors and deploy 30 pods across 3 zones. Twice.

service selector: {app:sd-status-staging}
rc selector  {app: sd-status-staging,pod-template-hash: 4090075901}

Expected distribution by zone: 10, 10 10.
Actual distribution by zone: 10, 8, 12 (imbalanced cluster).

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.8", GitCommit:"7eab6a49736cc7b01869a15f9f05dc5b49efb9fc", GitTreeState:"clean", BuildDate:"2018-09-14T15:54:20Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"linux/amd64"}
Happens on master too.

/sig scheduling

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 21, 2018
@k82cn
Copy link
Member

k82cn commented Nov 26, 2018

re Problem 1: for per node, we only need to compare current nodes as they will share the same maxPods. for per zone, we need to account other nodes, e.g. because of resource predicates.

re problem 2: a little about your case :(

@Ramyak
Copy link
Contributor Author

Ramyak commented Nov 26, 2018

Problem 1:

re Problem 1: for per node, we only need to compare current nodes as they will share the same maxPods. for per zone, we need to account other nodes, e.g. because of resource predicates.

Max pods per node : It does not matter if the denominator is all nodes or current nodes. One with higher utilization will still end up with a lower score.
Max pods per zone : It is critical that the denominator is all nodes. Else we are ending up with highly imbalanced cluster.

Problem 2:

re problem 2: a little about your case :( service selector and replicaset selector and a new deployment

Here is an example list of selectors.

service selector: {app:sd-status-staging}

Old deployment  (old pods)
rc selector  {app: sd-status-staging,pod-template-hash: 4090075901}

New deployment (new pods)
rc selector  {app: sd-status-staging,pod-template-hash: 5090075901}

All pods matching any one service selector are counted in CalculateSpreadPriorityMap

		for _, selector := range selectors {
			if selector.Matches(labels.Set(nodePod.ObjectMeta.Labels)) {
				count++
				break
			}
                 }

So in our case, old pods are counted during scheduling. But they get killed during the deployment.
If the maxSurge and maxUnavailable is high enough, you will end up with an imbalanced zone after deployment.

@Ramyak
Copy link
Contributor Author

Ramyak commented Nov 27, 2018

Gentle ping.

@Ramyak
Copy link
Contributor Author

Ramyak commented Nov 30, 2018

/assign bsalamat

@flyingcougar
Copy link

@Ramyak
Re: Problem 1 - You can also reproduce this by using pod anti-affinity. Zone score is calculated only based on nodes filtered by predicates. IMO zone score should be calculated using all nodes.

@bsalamat bsalamat added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Dec 7, 2018
@Ramyak
Copy link
Contributor Author

Ramyak commented Jan 10, 2019

I will try to separate this into two PRs. Any other suggestions welcome.

@Ramyak
Copy link
Contributor Author

Ramyak commented Jan 15, 2019

I am working Problem 1. Looks like I have to open a new issue.
https://stackoverflow.com/questions/21333654/how-to-re-open-an-issue-in-github

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/bug Categorizes issue or PR as related to a bug. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
5 participants