DaemonSet controller actively kills failed pods (to recreate them) #40330

janetkuo · 2017-01-23T23:29:59Z

Ref #36482, @erictune @yujuhong @mikedanese @Kargakis @lukaszo @piosz @kubernetes/sig-apps-bugs

This also helps with DaemonSet update

k8s-reviewable · 2017-01-23T23:30:06Z

This change is

yujuhong · 2017-01-24T17:44:50Z

pkg/controller/daemon/daemoncontroller.go

-		case shouldContinueRunning && len(daemonPods) > 1:
+		case shouldContinueRunning:
+			// If a daemon pod failed, delete it
+			// TODO: handle the case when the daemon pods fail consistently and causes kill-recreate hot loop


How often does the controller sync?

Shouldn't be that often. @janetkuo we probably want to cap on a maximum # of retries and then drop daemon sets out of the queue so we won't end up hotlooping.

How about returning errors (at the end) whenever there's a failed daemon pod? We use rate limiter when syncHandler returns an error. This can prevent hotloop

spxtr · 2017-01-24T17:59:27Z

Why not add or update a unit test for the new behavior?

0xmichalis · 2017-01-24T18:06:34Z

Why not add or update a unit test for the new behavior?

There is a new extended test added

0xmichalis · 2017-01-24T18:06:52Z

Just a comment about the hotloop, lgtm otherwise.

spxtr · 2017-01-24T18:09:14Z

There is a new extended test added

There is an e2e test, but I would much rather see a unit test.

janetkuo · 2017-01-25T00:13:04Z

Added a unit test and addressed hotloop issue; PTAL @spxtr @Kargakis

spxtr · 2017-01-25T02:30:52Z

Thanks. It looks reasonable overall, but I don't have much context. I'll let @Kargakis LGTM.

0xmichalis · 2017-01-25T08:48:40Z

pkg/controller/daemon/daemoncontroller.go

@@ -547,6 +563,10 @@ func (dsc *DaemonSetsController) manage(ds *extensions.DaemonSet) error {
 	for err := range errCh {
 		errors = append(errors, err)
 	}
+	if failedPodsObserved > 0 {


I am not sure I understand this - why do you need to return the error here? Won't the daemon set be resynced because of the deleted pod event anyway?

Ah you want to use the ratelimiter - ok. Although for perma-failed daemon sets we probably want to stop retrying them after a while.

We don't support perma-failed daemon sets yet. Normally DS controller would check if the daemon pod can be scheduled on the node before creating it, so it's unlikely it'll create pods that are doomed to fail. However, sometimes there could be a race condition that kubelet uses its own node object to admit pods, and then rejected the pods (pods become Failed).

Let's deal with this in a follow up PR?

Moved the comment to before if statement to make it more clear

0xmichalis · 2017-01-25T10:52:45Z

pkg/controller/daemon/daemoncontroller_test.go

@@ -653,6 +661,31 @@ func TestObservedGeneration(t *testing.T) {
 	}
 }

+// DaemonSet controller should kill all failed pods and recreate at most 1 failed pod.


"at most 1 pod on every node"

janetkuo · 2017-01-26T18:05:42Z

@Kargakis ptal

mikedanese · 2017-01-27T18:10:52Z

/approve

k8s-github-robot · 2017-01-27T18:32:07Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

The following people have approved this PR: mikedanese

Needs approval from an approver in each of these OWNERS Files:

~~pkg/controller/daemon/OWNERS~~ [mikedanese]
test/OWNERS

We suggest the following people:
cc @fejta
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

0xmichalis · 2017-01-27T20:14:03Z

/lgtm

k8s-github-robot · 2017-01-27T21:47:08Z

Automatic merge from submit-queue

@erictune

Automatic merge from submit-queue (batch tested with PRs 40556, 40720) Emit events on 'Failed' daemon pods Follow up #40330 @erictune @mikedanese @Kargakis @lukaszo @kubernetes/sig-apps-bugs

jethrogb · 2019-06-06T00:38:00Z

Is there a reason to do this in the controller instead of just letting kubelet do it?

janetkuo added area/workload-api/daemonset release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. labels Jan 23, 2017

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 23, 2017

k8s-github-robot assigned spxtr Jan 23, 2017

k8s-github-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 23, 2017

yujuhong reviewed Jan 24, 2017

View reviewed changes

janetkuo force-pushed the kill-failed-daemon-pods branch 2 times, most recently from f599875 to 33cf0c9 Compare January 24, 2017 22:50

k8s-github-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 24, 2017

fejta assigned 0xmichalis and yujuhong Jan 25, 2017

0xmichalis reviewed Jan 25, 2017

View reviewed changes

janetkuo added 3 commits January 25, 2017 10:20

DaemonSet controller actively kills failed pods (to recreate them)

a2e1341

Add unit test for deleting failed daemon pods

e46d445

Throw an error on failed daemon pods to prevent hotloop

634b695

janetkuo force-pushed the kill-failed-daemon-pods branch from 24f2926 to e502f78 Compare January 25, 2017 18:29

Address comments

81c1e0c

janetkuo force-pushed the kill-failed-daemon-pods branch from e502f78 to 81c1e0c Compare January 25, 2017 18:31

janetkuo added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 27, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2017

k8s-github-robot merged commit 62c8022 into kubernetes:master Jan 27, 2017

janetkuo mentioned this pull request Jan 31, 2017

Emit events on 'Failed' daemon pods #40720

Merged

yujuhong mentioned this pull request Feb 9, 2017

DaemonSet PODs randomly fail to start on management nodes after reboot #36482

Closed

marun mentioned this pull request Mar 1, 2017

ci-kubernetes-e2e-gce-serial-release-1.6: broken test run #42345

Closed

janetkuo mentioned this pull request Feb 22, 2018

DeletionTimeStamp not set for some evicted pods #54525

Open

bg-chun mentioned this pull request Dec 15, 2021

Whether pod deletion should enqueue the deployment again when there is only evicted pod left #106410

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DaemonSet controller actively kills failed pods (to recreate them) #40330

DaemonSet controller actively kills failed pods (to recreate them) #40330

janetkuo commented Jan 23, 2017 •

edited

Loading

k8s-reviewable commented Jan 23, 2017

yujuhong Jan 24, 2017

0xmichalis Jan 24, 2017

janetkuo Jan 24, 2017 •

edited

Loading

spxtr commented Jan 24, 2017

0xmichalis commented Jan 24, 2017

0xmichalis commented Jan 24, 2017

spxtr commented Jan 24, 2017

janetkuo commented Jan 25, 2017

spxtr commented Jan 25, 2017

0xmichalis Jan 25, 2017

0xmichalis Jan 25, 2017

janetkuo Jan 25, 2017

janetkuo Jan 25, 2017 •

edited

Loading

0xmichalis Jan 25, 2017

janetkuo Jan 25, 2017

janetkuo commented Jan 26, 2017

mikedanese commented Jan 27, 2017

k8s-github-robot commented Jan 27, 2017

0xmichalis commented Jan 27, 2017

k8s-github-robot commented Jan 27, 2017

jethrogb commented Jun 6, 2019

DaemonSet controller actively kills failed pods (to recreate them) #40330

DaemonSet controller actively kills failed pods (to recreate them) #40330

Conversation

janetkuo commented Jan 23, 2017 • edited Loading

k8s-reviewable commented Jan 23, 2017

yujuhong Jan 24, 2017

Choose a reason for hiding this comment

0xmichalis Jan 24, 2017

Choose a reason for hiding this comment

janetkuo Jan 24, 2017 • edited Loading

Choose a reason for hiding this comment

spxtr commented Jan 24, 2017

0xmichalis commented Jan 24, 2017

0xmichalis commented Jan 24, 2017

spxtr commented Jan 24, 2017

janetkuo commented Jan 25, 2017

spxtr commented Jan 25, 2017

0xmichalis Jan 25, 2017

Choose a reason for hiding this comment

0xmichalis Jan 25, 2017

Choose a reason for hiding this comment

janetkuo Jan 25, 2017

Choose a reason for hiding this comment

janetkuo Jan 25, 2017 • edited Loading

Choose a reason for hiding this comment

0xmichalis Jan 25, 2017

Choose a reason for hiding this comment

janetkuo Jan 25, 2017

Choose a reason for hiding this comment

janetkuo commented Jan 26, 2017

mikedanese commented Jan 27, 2017

k8s-github-robot commented Jan 27, 2017

0xmichalis commented Jan 27, 2017

k8s-github-robot commented Jan 27, 2017

jethrogb commented Jun 6, 2019

janetkuo commented Jan 23, 2017 •

edited

Loading

janetkuo Jan 24, 2017 •

edited

Loading

janetkuo Jan 25, 2017 •

edited

Loading