DeletionTimeStamp not set for some evicted pods #54525

patrickshan · 2017-10-24T23:58:39Z

Is this a BUG REPORT or FEATURE REQUEST?:

/kind bug

What happened:
When a node starts to evict pods under disk pressure, the DeletionTimestamp for some evicted pods are not set properly and still have zero value. It seems that pods created through Deployment are having this issue while pods created through DaemonSet have DeletionTimestamp set properly.

What you expected to happen:
Pods created through deployment should also have DeletionTimestamp set properly.

How to reproduce it (as minimally and precisely as possible):
Write an app to watch apiserver pods related events. Deploy a debian toolbox pod on one node using Deployment. Put that node under disk pressure like using more than 90% of disk space and consume more disk space from inside the toolbox pod (you can install some packages which uses lots of disk space like gnome-core for debian).

Anything else we need to know?:
You can find some events related with pod and they only have Phase updated to "Failed" but without setting DeletionTimestamp which still has zero value.

Environment:

Kubernetes version (use kubectl version): 1.8.1
Cloud provider or hardware configuration**: AWS
OS (e.g. from /etc/os-release): Container Linux by CoreOS 1520.4.0
Kernel (e.g. uname -a): Linux ip-10-150-64-105 4.13.3-coreos Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Wed Sep 20 22:17:11 UTC 2017 x86_64 Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz GenuineIntel GNU/Linux

The text was updated successfully, but these errors were encountered:

patrickshan · 2017-10-25T00:00:17Z

@kubernetes/sig-api-machinery-misc

k8s-ci-robot · 2017-10-25T00:00:25Z

@patrickshan: Reiterating the mentions to trigger a notification:
@kubernetes/sig-api-machinery-misc

In response to this:

@kubernetes/sig-api-machinery-misc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k82cn · 2017-10-25T05:24:16Z

/sig node

dixudx · 2017-10-25T14:37:31Z

/assign

lavalamp · 2017-10-26T19:47:48Z

Most likely this is sig-node.

smarterclayton · 2017-10-26T23:45:08Z

Deletion timestamp is set by the apiserver, it cannot be set by clients

liggitt · 2017-10-26T23:47:03Z

Are they actually evicted/deleted or do they just have failed status?

patrickshan · 2017-10-27T01:33:28Z

They are evicted and then marked as "Failed" status. For pods created through daemonset, their DeletionTimeStamp get set after "Failed" status set. But for pods created through deployment, their DeletionTimeStamp just keep zero value and never get updated.

smarterclayton · 2017-10-27T01:59:32Z

That means that no one deleted them. The ReplicaSet controller is responsible for performing that deletion. On Oct 26, 2017, at 9:33 PM, Patrick Shan <notifications@github.com> wrote: They are evicted and marked as "Failed" status first. For pods created through daemonset, their DeletionTimeStamp get set after "Failed" status set. But for pods created through deployment, their DeletionTimeStamp just keep zero value and never set. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#54525 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_pzHzsDIc9myC2sswt7sa5hK9ger_ks5swTLwgaJpZM4QFQOk> .

dixudx · 2017-10-27T15:11:27Z

That means that no one deleted them. The ReplicaSet controller is responsible for performing that deletion.

I reproduced this. But I found the pods were successfully evicted and deleted only from kubelet, not apiserver. Apiserver still kept a copy, with nil DeletionTimeStamp and ContainerStatuses. @liggitt Do you know why? Quite abnormal.

liggitt · 2017-10-27T15:30:14Z

I see the kubelet sync loop construct a pod status like what you describe if an internal module decides the pod should be evicted:

kubernetes/pkg/kubelet/kubelet_pods.go

Lines 1293 to 1305 in b00c15f

    
           func (kl *Kubelet) generateAPIPodStatus(pod *v1.Pod, podStatus *kubecontainer.PodStatus) v1.PodStatus { 
        
           	glog.V(3).Infof("Generating status for %q", format.Pod(pod)) 
        
           	// check if an internal module has requested the pod is evicted. 
        
           	for _, podSyncHandler := range kl.PodSyncHandlers { 
        
           		if result := podSyncHandler.ShouldEvict(pod); result.Evict { 
        
           			return v1.PodStatus{ 
        
           				Phase:   v1.PodFailed, 
        
           				Reason:  result.Reason, 
        
           				Message: result.Message, 
        
           			} 
        
           		} 
        
           	}

The kubelet then syncs status back to the API server:

kubernetes/pkg/kubelet/status/status_manager.go

Lines 437 to 488 in b00c15f

    
           func (m *manager) syncPod(uid types.UID, status versionedPodStatus) { 
        
           	if !m.needsUpdate(uid, status) { 
        
           		glog.V(1).Infof("Status for pod %q is up-to-date; skipping", uid) 
        
           		return 
        
           	} 
        
           	// TODO: make me easier to express from client code 
        
           	pod, err := m.kubeClient.CoreV1().Pods(status.podNamespace).Get(status.podName, metav1.GetOptions{}) 
        
           	if errors.IsNotFound(err) { 
        
           		glog.V(3).Infof("Pod %q (%s) does not exist on the server", status.podName, uid) 
        
           		// If the Pod is deleted the status will be cleared in 
        
           		// RemoveOrphanedStatuses, so we just ignore the update here. 
        
           		return 
        
           	} 
        
           	if err != nil { 
        
           		glog.Warningf("Failed to get status for pod %q: %v", format.PodDesc(status.podName, status.podNamespace, uid), err) 
        
           		return 
        
           	} 
        
           	translatedUID := m.podManager.TranslatePodUID(pod.UID) 
        
           	// Type convert original uid just for the purpose of comparison. 
        
           	if len(translatedUID) > 0 && translatedUID != kubetypes.ResolvedPodUID(uid) { 
        
           		glog.V(2).Infof("Pod %q was deleted and then recreated, skipping status update; old UID %q, new UID %q", format.Pod(pod), uid, translatedUID) 
        
           		m.deletePodStatus(uid) 
        
           		return 
        
           	} 
        
           	pod.Status = status.status 
        
           	// TODO: handle conflict as a retry, make that easier too. 
        
           	newPod, err := m.kubeClient.CoreV1().Pods(pod.Namespace).UpdateStatus(pod) 
        
           	if err != nil { 
        
           		glog.Warningf("Failed to update status for pod %q: %v", format.Pod(pod), err) 
        
           		return 
        
           	} 
        
           	pod = newPod 
        
           	glog.V(3).Infof("Status for pod %q updated successfully: (%d, %+v)", format.Pod(pod), status.version, status.status) 
        
           	m.apiStatusVersions[kubetypes.MirrorPodUID(pod.UID)] = status.version 
        
           	// We don't handle graceful deletion of mirror pods. 
        
           	if m.canBeDeleted(pod, status.status) { 
        
           		deleteOptions := metav1.NewDeleteOptions(0) 
        
           		// Use the pod UID as the precondition for deletion to prevent deleting a newly created pod with the same name and namespace. 
        
           		deleteOptions.Preconditions = metav1.NewUIDPreconditions(string(pod.UID)) 
        
           		err = m.kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name, deleteOptions) 
        
           		if err != nil { 
        
           			glog.Warningf("Failed to delete status for pod %q: %v", format.Pod(pod), err) 
        
           			return 
        
           		} 
        
           		glog.V(3).Infof("Pod %q fully terminated and removed from etcd", format.Pod(pod)) 
        
           		m.deletePodStatus(uid) 
        
           	} 
        
           }

But unless the pod's deletion timestamp is already set, the kubelet won't delete the pod:

kubernetes/pkg/kubelet/status/status_manager.go

Lines 504 to 509 in b00c15f

    
           func (m *manager) canBeDeleted(pod *v1.Pod, status v1.PodStatus) bool { 
        
           	if pod.DeletionTimestamp == nil || kubepod.IsMirrorPod(pod) { 
        
           		return false 
        
           	} 
        
           	return m.podDeletionSafety.PodResourcesAreReclaimed(pod, status) 
        
           }

@kubernetes/sig-node-bugs that doesn't seem like the kubelet does a complete job of evicting the pod from the API's perspective. Would you expect the kubelet to delete a pod directly in that case or to still go through posting a pod eviction (should pod disruption budget be honored in cases where the kubelet is out of resources?)

yujuhong · 2017-10-27T17:22:57Z

@kubernetes/sig-node-bugs that doesn't seem like the kubelet does a complete job of evicting the pod from the API's perspective. Would you expect the kubelet to delete a pod directly in that case or to still go through posting a pod eviction (should pod disruption budget be honored in cases where the kubelet is out of resources?)

I think this is intentional.

AFAIk, kubelet's pod eviction includes failing the pod (i.e., setting the pod status) and reclaiming the resources used by the pod on the node. There is no "deleting the pod from the apiserver" involved in the eviction. Users/controllers and check the pod status to know what happened to the pod if needed.

/cc @derekwaynecarr @dashpole

dashpole · 2017-10-27T17:32:01Z

Yes, this is intentional. In order for evicted pods to be inspected after eviction, we do not remove the pod API object. Otherwise it would appear that the pod simply disappeared
We do still stop and remove all containers, clean up cgroups, unmount volumes, etc to ensure that we reclaim all resources that were in use by the pod.
I dont think we set the deletion timestamp for daemon set pods. I suspect that the daemon set controller deletes evicted pods.

fejta-bot · 2018-01-25T22:32:29Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

janetkuo · 2018-02-22T19:05:08Z

In order for evicted pods to be inspected after eviction, we do not remove the pod API object.

If the controller that creates the evicted pod is scaled down, it should kill those evicted pods first before killing any others, right? Most workload controllers don't do that today.

I dont think we set the deletion timestamp for daemon set pods. I suspect that the daemon set controller deletes evicted pods.

DaemonSet controller actively deletes failed pods (#40330), to ensure that DaemonSet can recover from transient errors (#36482). Evicted DaemonSet pods get killed just because they're also failed pods.

/remove-lifecycle stale

enisoc · 2018-02-22T22:31:24Z

In order for evicted pods to be inspected after eviction, we do not remove the pod API object.

If the controller that creates the evicted pod is scaled down, it should kill those evicted pods first before killing any others, right? Most workload controllers don't do that today.

For something like StatefulSet, it's actually necessary to immediately delete any Pods evicted by kubelet, so the Pod name can be reused. As @janetkuo also mentioned, DaemonSet does this as well. For such controllers, you're thus not gaining anything from kubelet leaving the Pod record.

Even for something like ReplicaSet, it probably makes the most sense for the controller to delete Pods evicted by kubelet (though it doesn't do that now, see #60162) to avoid carrying along Failed Pods indefinitely.

So I would argue that in pretty much all cases, Pods with restartPolicy: Always that go to Failed should be expediently deleted by some controller, so users can't expect such Pods to stick around.

If we can agree that some controller should delete them, the only question left is which controller? I suggest that the Node controller makes the most sense: delete any Failed Pods with restartPolicy: Always that are scheduled to me. Otherwise, we effectively shift the responsibility to "all Pod/workload controllers that exist or ever will exist." Given the explosion of custom controllers that's coming thanks to CRD, I don't think it's prudent to put that responsibility on every controller author.

Otherwise it would appear that the pod simply disappeared

With the /eviction subresource and Node drains, we have already set the precedent that your Pods might simply disappear (if the eviction succeeds, the Pod is deleted from the API server) at any time, without a trace.

fejta-bot · 2018-05-23T23:19:05Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2019-10-30T10:49:49Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

schollii · 2020-03-05T19:45:05Z

@dashpole:

is this documented anywhere in the online docs? I could not find this on https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/ or https://kubernetes.io/docs/concepts/workloads/pods/disruptions/.
Also, what is the recommended way of deleting a whole bunch of evicted pods using kubectl?

dashpole · 2020-03-05T23:41:00Z

@schollii I suppose we don't document the list of things that do not happen during evictions. The out-of-resource documentation says "it terminates all of its containers and transitions its PodPhase to Failed". It doesn't explicitly call out that it does not set the deletion timestamp.

Some googling says you can reference evicted pods with: --field-selector=status.phase=Failed. You should be able to list, delete, etc with that.

schollii · 2020-03-06T13:26:50Z

@dashpole I saw the mentions of --field-selector=status.phase=Failed but the problem there is that the "reason" is actually what says "evicted", so there could be failed pods that were not evicted. And you cannot select on status.reason, I tried. So we are left with grepping and awking the output of get pods -o wide. This needs fixing. E.g. make status.reason selectable, or have a phase called Evicted (although I doubt this is acceptable because not backwards compat). Or just have a command kubectl delete pods --evicted-only. If it can be fixed by a newbie to the k8s code base, I'd be happy to do it.

barney-s · 2020-03-14T07:30:12Z

should we be explicit in setting the deletion timestamp ?

lavalamp · 2020-03-16T15:47:31Z

grepping and awking the output of get pods -o wide

Use jq and -o json for stuff like this.

There's a "podgc" controller which deletes old pods, is it not triggering for evicted pods? How many do you accumulate? Why is it problematic?

should we be explicit in setting the deletion timestamp ?

I am not sure what the contract between kubelet / scheduler / controller is for evictions. Which entity is supposed to delete the pod? I assume they are not deleted by kubelet to give signal to scheduler/controller about the lack of fit?

andyxning · 2021-02-05T05:08:06Z

Should Deployment check and delete Failed pods like what has been done in SatefulSet and DaemonSet?

andyxning · 2021-02-05T05:09:31Z

Or pod GC should come in and cover this for other Resources besides StatefulSet and DaemonSet?

andyxning · 2021-02-05T05:12:27Z

Just for someone who also interested in how failed pod deleted is done in StatefulSet controller:

kubernetes/pkg/controller/statefulset/stateful_set_control.go

Lines 384 to 394 in c5759ab

    
           for i := range replicas { 
        
           	// delete and recreate failed pods 
        
           	if isFailed(replicas[i]) { 
        
           		ssc.recorder.Eventf(set, v1.EventTypeWarning, "RecreatingFailedPod", 
        
           			"StatefulSet %s/%s is recreating failed Pod %s", 
        
           			set.Namespace, 
        
           			set.Name, 
        
           			replicas[i].Name) 
        
           		if err := ssc.podControl.DeleteStatefulPod(set, replicas[i]); err != nil { 
        
           			return &status, err 
        
           		}

matthyx · 2021-06-25T13:45:03Z

/triage accepted
/help
/priority important-longterm

k8s-ci-robot · 2021-06-25T13:45:04Z

@matthyx:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/triage accepted
/help
/priority important-longterm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bboreham · 2021-12-15T10:57:35Z

According to StackOverflow:

the evicted pods will hang around until the number of them reaches the terminated-pod-gc-threshold limit (it's an option of kube-controller-manager and is equal to 12500 by default)

k8s-triage-robot · 2023-02-08T08:27:35Z

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-ci-robot · 2023-02-08T08:27:39Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 24, 2017

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 24, 2017

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Oct 25, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Oct 25, 2017

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 25, 2017

k8s-ci-robot assigned dixudx Oct 25, 2017

lavalamp removed the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label Oct 26, 2017

liggitt mentioned this issue Nov 3, 2017

Kubelet does not delete evicted pods #55051

Closed

ryanschwartz mentioned this issue Nov 13, 2017

HPA not scaling due to evicted pods #55630

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 25, 2018

KnVerey mentioned this issue Feb 21, 2018

Deployment controller fails to create new RS when old RS has an evicted pod #60162

Closed

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2018

janetkuo mentioned this issue Mar 15, 2018

Flaky timeouts while waiting for RC pods to be running in density test #61189

Closed

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 30, 2019

liggitt added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Nov 28, 2019

moonek mentioned this issue Feb 24, 2020

Too many evicted pods in nodes with low resource #88304

Closed

mattmattox mentioned this issue Apr 30, 2020

Failed/Evicted Pods left behind after disk pressure rancher/rancher#26863

Open

mattmattox mentioned this issue Feb 1, 2021

Cleanup evicted pods left behind after disk pressure rancherlabs/support-tools#87

Merged

morimoto-cybozu mentioned this issue Jul 27, 2021

Add finalizer for AddressPool cybozu-go/coil#168

Merged

mszostok mentioned this issue Oct 11, 2021

Periodic tests on long-running cluster are failing capactio/capact#532

Closed

bg-chun mentioned this issue Dec 15, 2021

Whether pod deletion should enqueue the deployment again when there is only evicted pod left #106410

Closed

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. and removed triage/accepted Indicates an issue or PR is ready to be actively worked on. labels Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeletionTimeStamp not set for some evicted pods #54525

DeletionTimeStamp not set for some evicted pods #54525

patrickshan commented Oct 24, 2017 •

edited

Loading

patrickshan commented Oct 25, 2017

k8s-ci-robot commented Oct 25, 2017

k82cn commented Oct 25, 2017

dixudx commented Oct 25, 2017

lavalamp commented Oct 26, 2017

smarterclayton commented Oct 26, 2017

liggitt commented Oct 26, 2017

patrickshan commented Oct 27, 2017 •

edited

Loading

smarterclayton commented Oct 27, 2017 via email

dixudx commented Oct 27, 2017

liggitt commented Oct 27, 2017 •

edited

Loading

yujuhong commented Oct 27, 2017

dashpole commented Oct 27, 2017

fejta-bot commented Jan 25, 2018

janetkuo commented Feb 22, 2018 •

edited

Loading

enisoc commented Feb 22, 2018

fejta-bot commented May 23, 2018

fejta-bot commented Oct 30, 2019

schollii commented Mar 5, 2020 •

edited

Loading

dashpole commented Mar 5, 2020

schollii commented Mar 6, 2020

barney-s commented Mar 14, 2020

lavalamp commented Mar 16, 2020

andyxning commented Feb 5, 2021

andyxning commented Feb 5, 2021

andyxning commented Feb 5, 2021

matthyx commented Jun 25, 2021

k8s-ci-robot commented Jun 25, 2021

bboreham commented Dec 15, 2021

k8s-triage-robot commented Feb 8, 2023

k8s-ci-robot commented Feb 8, 2023

DeletionTimeStamp not set for some evicted pods #54525

DeletionTimeStamp not set for some evicted pods #54525

Comments

patrickshan commented Oct 24, 2017 • edited Loading

patrickshan commented Oct 25, 2017

k8s-ci-robot commented Oct 25, 2017

k82cn commented Oct 25, 2017

dixudx commented Oct 25, 2017

lavalamp commented Oct 26, 2017

smarterclayton commented Oct 26, 2017

liggitt commented Oct 26, 2017

patrickshan commented Oct 27, 2017 • edited Loading

smarterclayton commented Oct 27, 2017 via email

dixudx commented Oct 27, 2017

liggitt commented Oct 27, 2017 • edited Loading

yujuhong commented Oct 27, 2017

dashpole commented Oct 27, 2017

fejta-bot commented Jan 25, 2018

janetkuo commented Feb 22, 2018 • edited Loading

enisoc commented Feb 22, 2018

fejta-bot commented May 23, 2018

fejta-bot commented Oct 30, 2019

schollii commented Mar 5, 2020 • edited Loading

dashpole commented Mar 5, 2020

schollii commented Mar 6, 2020

barney-s commented Mar 14, 2020

lavalamp commented Mar 16, 2020

andyxning commented Feb 5, 2021

andyxning commented Feb 5, 2021

andyxning commented Feb 5, 2021

matthyx commented Jun 25, 2021

k8s-ci-robot commented Jun 25, 2021

bboreham commented Dec 15, 2021

k8s-triage-robot commented Feb 8, 2023

k8s-ci-robot commented Feb 8, 2023

patrickshan commented Oct 24, 2017 •

edited

Loading

patrickshan commented Oct 27, 2017 •

edited

Loading

liggitt commented Oct 27, 2017 •

edited

Loading

janetkuo commented Feb 22, 2018 •

edited

Loading

schollii commented Mar 5, 2020 •

edited

Loading