Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs failing when a node is preempted #999

Closed
matthen opened this issue May 15, 2019 · 21 comments
Closed

Jobs failing when a node is preempted #999

matthen opened this issue May 15, 2019 · 21 comments

Comments

@matthen
Copy link

matthen commented May 15, 2019

On google kubernetes engine, I am finding that TFJobs fail when a node running a worker is pre-empted.

I have set restartPolicy: OnFailure for the workers, evaluator and chief. The tf-operator deployment is in a node pool with nodes that cannot be preempted.
It looks like some of the pods got restarted around the time of the preemption, but finally the job was stopped with the following status:

  Message:               TFJob myjob has failed because 1 Worker replica(s) failed.
    Reason:                TFJobFailed
    Status:                True
    Type:                  Failed
  Replica Statuses:
    Chief:
    Evaluator:
      Active:  1
    PS:
      Active:  4
    Worker:
      Active:  6
      Failed:  1

Is there something that needs to be done to make tfjobs handle pre-empted nodes?

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.82. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@matthen
Copy link
Author

matthen commented May 22, 2019

I'd appreciate any help with this. I saw one error on a worker that was preempted, saying it did not have enough nvidia/gpu. So it sounds like the GPU was taken away from the instance, causing the worker to fail. And even though it has restartPolicy: OnFailure, it caused the whole tf-job to fail.

Note when there is a failure in the code, e.g. sys.exit(1), then workers are correctly restarted.

@johnugeorge
Copy link
Member

Can you provide logs of worker and controller? Is this similar to #366?

@richardsliu @gaocegege

@matthen
Copy link
Author

matthen commented May 22, 2019

Any tips on how to recreate this, so I can monitor everything as it happens? If I delete the VM instance, then everything works correctly. The workers on that node stop, a new node is scheduled, and they start back up again. Could there be something different that happens in a real preemption? (Are the GPUs sometimes detached from preemptible instances while the instances themselves keep running?)

@matthen
Copy link
Author

matthen commented May 22, 2019

Aha, it just happened.

On the dashboard I see worker-0 has failed with the message "Pod Node didn't have enough resource: nvidia.com/gpu, requested: 1, used: 0, capacity: 0"

Doing kubectl describe for the worker pod, I see:

Warning  OutOfnvidia.com/gpu  7m    kubelet, gke-train-model-gpu-pool-a83bc04b-nq3l  Node didn't have enough resource: nvidia.com/gpu, requested: 1, used: 0, capacity: 0
Status:             Failed
Reason:             OutOfnvidia.com/gpu
Message:            Pod Node didn't have enough resource: nvidia.com/gpu, requested: 1, used: 0, capacity: 0

The worker's logs do not have any error message, but are just cut off abruptly.

The chief has the error message that happens when one of the other workers or parameter servers goes down:

An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed

Doing kubectl describe tfjob I get:

Message:               TFJob job has failed because 1 Worker replica(s) failed.
    Reason:                TFJobFailed
    Status:                True
    Type:                  Failed

Here are the relevant logs from tf-job-operator at the time of the failure

{"filename":"tensorflow/controller.go:340","job":"default.job","level":"info","msg":"Reconcile TFJobs job","time":"2019-05-22T12:44:14Z","uid":"2a560be3-7c89-11e9-99d3-42010aa4004a"}
{"filename":"k8sutil/k8sutil.go:101","level":"info","msg":"Ignoring inactive pod default/job-worker-0 in state Failed, deletion time \u003cnil\u003e","time":"2019-05-22T12:44:14Z"}
{"filename":"tensorflow/status.go:57","job":"default.job","level":"info","msg":"TFJob=job, ReplicaType=PS expected=4, running=4, failed=0","time":"2019-05-22T12:44:14Z","uid":"2a560be3-7c89-11e9-99d3-42010aa4004a"}
{"filename":"tensorflow/status.go:57","job":"default.job","level":"info","msg":"TFJob=job, ReplicaType=Worker expected=12, running=11, failed=1","time":"2019-05-22T12:44:14Z","uid":"2a560be3-7c89-11e9-99d3-42010aa4004a"}
{"filename":"tensorflow/status.go:57","job":"default.job","level":"info","msg":"TFJob=job, ReplicaType=Chief expected=1, running=1, failed=0","time":"2019-05-22T12:44:14Z","uid":"2a560be3-7c89-11e9-99d3-42010aa4004a"}
{"filename":"tensorflow/status.go:57","job":"default.job","level":"info","msg":"TFJob=job, ReplicaType=Evaluator expected=1, running=1, failed=0","time":"2019-05-22T12:44:14Z","uid":"2a560be3-7c89-11e9-99d3-42010aa4004a"}
{"filename":"record/event.go:221","level":"info","msg":"Event(v1.ObjectReference{Kind:\"TFJob\", Namespace:\"default\", Name:\"job\", UID:\"2a560be3-7c89-11e9-99d3-42010aa4004a\", APIVersion:\"kubeflow.org/v1beta2\", ResourceVersion:\"3541041\", FieldPath:\"\"}): type: 'Normal' reason: 'TFJobFailed' TFJob job has failed because 1 Worker replica(s) failed.","time":"2019-05-22T12:44:14Z"}
{"filename":"tensorflow/job.go:120","level":"info","msg":"Updating tfjob: job","time":"2019-05-22T12:44:14Z"}
{"filename":"tensorflow/controller.go:284","job":"default.job","level":"info","msg":"Finished syncing tfjob \"default/job\" (16.277576ms)","time":"2019-05-22T12:44:14Z"}

thanks for any help

@matthen
Copy link
Author

matthen commented May 23, 2019

I wonder if this was the individual GPU running out of memory and failing (I was trying to push its limits). Before I had seen an error message about going out of memory, which I didn't see here.

@gaocegege
Copy link
Member

@matthen If the resource is not enough, the pod cannot be scheduled and it may be failed. But we do not show the real reason in TFJob since we think it can be shown in pods. I am not sure if we need show the info in TFJob. You know, there are so many problems that can fail a pod.

@cheyang
Copy link
Contributor

cheyang commented Jun 28, 2019

@matthen If the resource is not enough, the pod cannot be scheduled and it may be failed. But we do not show the real reason in TFJob since we think it can be shown in pods. I am not sure if we need show the info in TFJob. You know, there are so many problems that can fail a pod.

The pod is killed and disappeared. So I think it would be useful to keep preempted status in the TFJob.

@gaocegege
Copy link
Member

@cheyang

I am not sure how to implement it. Should we aggregate all status of all PS/workers into TFJob status?

@chardch
Copy link

chardch commented Dec 2, 2019

@matthen When a gke gpu node is preempted, it is recreated. However, if the node is recreated shortly after being preempted (before the pods have been evicted ~5min), then upon node startup, the preexisting pods will still be running on the node, potentially before all system pods are finished setting up.
Thus there is a possibility for the GPU pods to be running on the node before the ndivida-driver-installer daemonset has installed the nvidia driver which makes the GPU devices available. Here is the relevant issue: kubernetes/kubernetes#64632 and relevant PRs: kubernetes/kubernetes#64784, kubernetes/kubernetes#77699

@srinjay-paul
Copy link

Has anyone found a solution/workaround? The issue persists in gke version 1.16.0 which is supposed to include the commits made in the PRs mentioned above by @chardch

@jtfogarty
Copy link

/area engprod
/priority p2

@srinjay-paul
Copy link

Is there any progress regarding this issue? Without solving this issue, distributed training using kubeflow and preemptible GPU nodes is impossible. Doesn't Kubeflow claim to do both?

@matthen
Copy link
Author

matthen commented Feb 8, 2020

+1 we have had to switch to non preemptible nodes for tfjobs to work

@stale
Copy link

stale bot commented Jul 31, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Aug 8, 2020
@matthen
Copy link
Author

matthen commented Sep 30, 2020

Are there any fixes for this, or can we not use kubeflow with preemptible GPUs? @jtfogarty @jbottum @jlewi

@ChanYiLin ChanYiLin reopened this Oct 7, 2020
@stale stale bot removed the lifecycle/stale label Oct 7, 2020
@ChanYiLin
Copy link
Member

ChanYiLin commented Oct 7, 2020

I think I have fixed this issue while refactoring the whole project
https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/status.go#L170
If there is pod restarting, the job will not be marked as failed.

@gaocegege @Jeffwan is there a way to allow people to use latest tf-operator image?

@matthen
Copy link
Author

matthen commented Oct 7, 2020

(I ended up switching to regular k8s services + jobs, and adding logic in the workers + parameter servers themselves to make sure they exit successfully at the end of training)

@Jeffwan
Copy link
Member

Jeffwan commented Oct 8, 2020

@ChanYiLin Release infra and test-infra are blocked currently. We have to wait for a while or do manual release.

@stale
Copy link

stale bot commented Jan 10, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Jan 17, 2021
@liubing0427
Copy link

Is there any progress on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests