Jobs failing when a node is preempted #999

matthen · 2019-05-15T04:23:10Z

On google kubernetes engine, I am finding that TFJobs fail when a node running a worker is pre-empted.

I have set restartPolicy: OnFailure for the workers, evaluator and chief. The tf-operator deployment is in a node pool with nodes that cannot be preempted.
It looks like some of the pods got restarted around the time of the preemption, but finally the job was stopped with the following status:

  Message:               TFJob myjob has failed because 1 Worker replica(s) failed.
    Reason:                TFJobFailed
    Status:                True
    Type:                  Failed
  Replica Statuses:
    Chief:
    Evaluator:
      Active:  1
    PS:
      Active:  4
    Worker:
      Active:  6
      Failed:  1

Is there something that needs to be done to make tfjobs handle pre-empted nodes?

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2019-05-15T04:23:12Z

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.82. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

matthen · 2019-05-22T05:20:24Z

I'd appreciate any help with this. I saw one error on a worker that was preempted, saying it did not have enough nvidia/gpu. So it sounds like the GPU was taken away from the instance, causing the worker to fail. And even though it has restartPolicy: OnFailure, it caused the whole tf-job to fail.

Note when there is a failure in the code, e.g. sys.exit(1), then workers are correctly restarted.

johnugeorge · 2019-05-22T07:01:41Z

Can you provide logs of worker and controller? Is this similar to #366?

@richardsliu @gaocegege

matthen · 2019-05-22T10:41:54Z

Any tips on how to recreate this, so I can monitor everything as it happens? If I delete the VM instance, then everything works correctly. The workers on that node stop, a new node is scheduled, and they start back up again. Could there be something different that happens in a real preemption? (Are the GPUs sometimes detached from preemptible instances while the instances themselves keep running?)

matthen · 2019-05-22T13:04:11Z

Aha, it just happened.

On the dashboard I see worker-0 has failed with the message "Pod Node didn't have enough resource: nvidia.com/gpu, requested: 1, used: 0, capacity: 0"

Doing kubectl describe for the worker pod, I see:

Warning  OutOfnvidia.com/gpu  7m    kubelet, gke-train-model-gpu-pool-a83bc04b-nq3l  Node didn't have enough resource: nvidia.com/gpu, requested: 1, used: 0, capacity: 0
Status:             Failed
Reason:             OutOfnvidia.com/gpu
Message:            Pod Node didn't have enough resource: nvidia.com/gpu, requested: 1, used: 0, capacity: 0

The worker's logs do not have any error message, but are just cut off abruptly.

The chief has the error message that happens when one of the other workers or parameter servers goes down:

An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed

Doing kubectl describe tfjob I get:

Message:               TFJob job has failed because 1 Worker replica(s) failed.
    Reason:                TFJobFailed
    Status:                True
    Type:                  Failed

Here are the relevant logs from tf-job-operator at the time of the failure

{"filename":"tensorflow/controller.go:340","job":"default.job","level":"info","msg":"Reconcile TFJobs job","time":"2019-05-22T12:44:14Z","uid":"2a560be3-7c89-11e9-99d3-42010aa4004a"}
{"filename":"k8sutil/k8sutil.go:101","level":"info","msg":"Ignoring inactive pod default/job-worker-0 in state Failed, deletion time \u003cnil\u003e","time":"2019-05-22T12:44:14Z"}
{"filename":"tensorflow/status.go:57","job":"default.job","level":"info","msg":"TFJob=job, ReplicaType=PS expected=4, running=4, failed=0","time":"2019-05-22T12:44:14Z","uid":"2a560be3-7c89-11e9-99d3-42010aa4004a"}
{"filename":"tensorflow/status.go:57","job":"default.job","level":"info","msg":"TFJob=job, ReplicaType=Worker expected=12, running=11, failed=1","time":"2019-05-22T12:44:14Z","uid":"2a560be3-7c89-11e9-99d3-42010aa4004a"}
{"filename":"tensorflow/status.go:57","job":"default.job","level":"info","msg":"TFJob=job, ReplicaType=Chief expected=1, running=1, failed=0","time":"2019-05-22T12:44:14Z","uid":"2a560be3-7c89-11e9-99d3-42010aa4004a"}
{"filename":"tensorflow/status.go:57","job":"default.job","level":"info","msg":"TFJob=job, ReplicaType=Evaluator expected=1, running=1, failed=0","time":"2019-05-22T12:44:14Z","uid":"2a560be3-7c89-11e9-99d3-42010aa4004a"}
{"filename":"record/event.go:221","level":"info","msg":"Event(v1.ObjectReference{Kind:\"TFJob\", Namespace:\"default\", Name:\"job\", UID:\"2a560be3-7c89-11e9-99d3-42010aa4004a\", APIVersion:\"kubeflow.org/v1beta2\", ResourceVersion:\"3541041\", FieldPath:\"\"}): type: 'Normal' reason: 'TFJobFailed' TFJob job has failed because 1 Worker replica(s) failed.","time":"2019-05-22T12:44:14Z"}
{"filename":"tensorflow/job.go:120","level":"info","msg":"Updating tfjob: job","time":"2019-05-22T12:44:14Z"}
{"filename":"tensorflow/controller.go:284","job":"default.job","level":"info","msg":"Finished syncing tfjob \"default/job\" (16.277576ms)","time":"2019-05-22T12:44:14Z"}

thanks for any help

matthen · 2019-05-23T03:10:57Z

I wonder if this was the individual GPU running out of memory and failing (I was trying to push its limits). Before I had seen an error message about going out of memory, which I didn't see here.

gaocegege · 2019-05-23T03:14:58Z

@matthen If the resource is not enough, the pod cannot be scheduled and it may be failed. But we do not show the real reason in TFJob since we think it can be shown in pods. I am not sure if we need show the info in TFJob. You know, there are so many problems that can fail a pod.

cheyang · 2019-06-28T23:19:48Z

@matthen If the resource is not enough, the pod cannot be scheduled and it may be failed. But we do not show the real reason in TFJob since we think it can be shown in pods. I am not sure if we need show the info in TFJob. You know, there are so many problems that can fail a pod.

The pod is killed and disappeared. So I think it would be useful to keep preempted status in the TFJob.

gaocegege · 2019-10-29T09:26:18Z

@cheyang

I am not sure how to implement it. Should we aggregate all status of all PS/workers into TFJob status?

chardch · 2019-12-02T19:11:01Z

@matthen When a gke gpu node is preempted, it is recreated. However, if the node is recreated shortly after being preempted (before the pods have been evicted ~5min), then upon node startup, the preexisting pods will still be running on the node, potentially before all system pods are finished setting up.
Thus there is a possibility for the GPU pods to be running on the node before the ndivida-driver-installer daemonset has installed the nvidia driver which makes the GPU devices available. Here is the relevant issue: kubernetes/kubernetes#64632 and relevant PRs: kubernetes/kubernetes#64784, kubernetes/kubernetes#77699

srinjay-paul · 2020-01-10T21:09:59Z

Has anyone found a solution/workaround? The issue persists in gke version 1.16.0 which is supposed to include the commits made in the PRs mentioned above by @chardch

jtfogarty · 2020-01-14T20:33:14Z

/area engprod
/priority p2

srinjay-paul · 2020-02-07T20:04:32Z

Is there any progress regarding this issue? Without solving this issue, distributed training using kubeflow and preemptible GPU nodes is impossible. Doesn't Kubeflow claim to do both?

matthen · 2020-02-08T02:26:13Z

+1 we have had to switch to non preemptible nodes for tfjobs to work

stale · 2020-07-31T17:44:32Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

matthen · 2020-09-30T13:44:46Z

Are there any fixes for this, or can we not use kubeflow with preemptible GPUs? @jtfogarty @jbottum @jlewi

ChanYiLin · 2020-10-07T03:08:56Z

I think I have fixed this issue while refactoring the whole project
https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/status.go#L170
If there is pod restarting, the job will not be marked as failed.

@gaocegege @Jeffwan is there a way to allow people to use latest tf-operator image?

matthen · 2020-10-07T04:33:38Z

(I ended up switching to regular k8s services + jobs, and adding logic in the workers + parameter servers themselves to make sure they exit successfully at the end of training)

Jeffwan · 2020-10-08T18:16:55Z

@ChanYiLin Release infra and test-infra are blocked currently. We have to wait for a while or do manual release.

stale · 2021-01-10T04:42:10Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

liubing0427 · 2023-10-11T12:14:31Z

Is there any progress on this issue?

issue-label-bot bot added the kind/bug label May 15, 2019

k8s-ci-robot added area/engprod priority/p2 labels Jan 14, 2020

jbottum added area/tfjob kind/process and removed area/engprod labels Jan 24, 2020

stale bot added the lifecycle/stale label Jul 31, 2020

stale bot closed this as completed Aug 8, 2020

ChanYiLin reopened this Oct 7, 2020

stale bot removed the lifecycle/stale label Oct 7, 2020

stale bot added the lifecycle/stale label Jan 10, 2021

stale bot closed this as completed Jan 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jobs failing when a node is preempted #999

Jobs failing when a node is preempted #999

matthen commented May 15, 2019 •

edited

Loading

issue-label-bot bot commented May 15, 2019

matthen commented May 22, 2019

johnugeorge commented May 22, 2019

matthen commented May 22, 2019

matthen commented May 22, 2019

matthen commented May 23, 2019

gaocegege commented May 23, 2019

cheyang commented Jun 28, 2019

gaocegege commented Oct 29, 2019

chardch commented Dec 2, 2019

srinjay-paul commented Jan 10, 2020

jtfogarty commented Jan 14, 2020

srinjay-paul commented Feb 7, 2020

matthen commented Feb 8, 2020

stale bot commented Jul 31, 2020

matthen commented Sep 30, 2020

ChanYiLin commented Oct 7, 2020 •

edited

Loading

matthen commented Oct 7, 2020

Jeffwan commented Oct 8, 2020

stale bot commented Jan 10, 2021

liubing0427 commented Oct 11, 2023

Jobs failing when a node is preempted #999

Jobs failing when a node is preempted #999

Comments

matthen commented May 15, 2019 • edited Loading

issue-label-bot bot commented May 15, 2019

matthen commented May 22, 2019

johnugeorge commented May 22, 2019

matthen commented May 22, 2019

matthen commented May 22, 2019

matthen commented May 23, 2019

gaocegege commented May 23, 2019

cheyang commented Jun 28, 2019

gaocegege commented Oct 29, 2019

chardch commented Dec 2, 2019

srinjay-paul commented Jan 10, 2020

jtfogarty commented Jan 14, 2020

srinjay-paul commented Feb 7, 2020

matthen commented Feb 8, 2020

stale bot commented Jul 31, 2020

matthen commented Sep 30, 2020

ChanYiLin commented Oct 7, 2020 • edited Loading

matthen commented Oct 7, 2020

Jeffwan commented Oct 8, 2020

stale bot commented Jan 10, 2021

liubing0427 commented Oct 11, 2023

matthen commented May 15, 2019 •

edited

Loading

ChanYiLin commented Oct 7, 2020 •

edited

Loading