Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCPBATCH preemptible and missing output file leads to retry with persistent filesystem. #7489

Closed
kachulis opened this issue Aug 8, 2024 · 1 comment

Comments

@kachulis
Copy link
Collaborator

kachulis commented Aug 8, 2024

I am seeing some rather strange behavior when running cromwell w/ GCPBATCH backend.

Essentially, what I'm seeing is that, with the GCPBATCH backend, when a task has preemptible>0 and specifies an output file which does not exist, after failing to delocalize the output file, it will rerun the task with the file system persisting. As an example, see the attached test_simple.wdl. The task in this wdl does the following:

  1. print "first ls"
  2. ls
  3. create file this_file_should_only_exist_for_second_ls.{current_date_and_time}
  4. print "second ls"
  5. ls

it also specifies an output file output_file which is never created, and preemptible: 1

My expectation is that when run, the first ls will print out the current directory contents (which will not include anything matching this_file_should_only_exist_for_second_ls.* ) and then the second ls will have the same directory contents except also with a this_file_should_only_exist_for_second_ls.{current_date_and_time} included, and then the workflow will fail due to the non existent output file. This is indeed the behavior when run with PAPIv2 backend, based on looking at the stdout in the execution bucket.

However, if I run with GCPBATCH backend, at first the stdout in the execution bucket will show the expected behavior. However, the workflow will keep running, and the stdout will soon change. In the updated stdout, the first ls will include a single file matching this_file_should_only_exist_for_second_ls.* as well as an unexpected rc file, along with two (instead of one) tmp.* files. The second ls will then include a second file matching this_file_should_only_exist_for_second_ls.* , along with the other unexpected files. My interpretation of this is that the task is being reattempted with the same filesystem, without any cleaning of the filesystem between runs.

I have also found that if the task is not run with preemptible, or if there aren't any missing output files, then the behavior will be as expected (even if the task still fails due to a non-zero exit code). There's a few different combinations of settings I tried in test.wdl.

I also see the following stack trace in cromwell, though unclear if it is related

[2024-08-08 15:29:20,58] [info] WorkflowManagerActor: Workflow 5892e197-e4c8-4d75-a648-e906d5ec80e6 failed (during ExecutingWorkflowState): java.lang.RuntimeException: Task test.test_task_output_does_not_exist:NA:1 failed for unknown reason: Failed
	at cromwell.backend.standard.StandardAsyncExecutionActor.handleExecutionFailure(StandardAsyncExecutionActor.scala:1170)
	at cromwell.backend.standard.StandardAsyncExecutionActor.handleExecutionFailure$(StandardAsyncExecutionActor.scala:1169)
	at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.handleExecutionFailure(GcpBatchAsyncBackendJobExecutionActor.scala:123)
	at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$handleExecutionResult$9(StandardAsyncExecutionActor.scala:1435)
	at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:470)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

example wdls test_simple.wdl and test.wdl are packaged here: wdls.zip

@dspeck1
Copy link
Collaborator

dspeck1 commented Aug 20, 2024

Hi @kachulis - this behavior is expected with GCP Batch as the persistent disk gets remounted to the replacement VM. There is not a way to specify a new disk on the Google Cloud side so please add some logic to your workflow to clear out files if present if that is needed.

@dspeck1 dspeck1 closed this as completed Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants