GCPBATCH preemptible and missing output file leads to retry with persistent filesystem. #7489

kachulis · 2024-08-08T23:29:10Z

I am seeing some rather strange behavior when running cromwell w/ GCPBATCH backend.

Essentially, what I'm seeing is that, with the GCPBATCH backend, when a task has preemptible>0 and specifies an output file which does not exist, after failing to delocalize the output file, it will rerun the task with the file system persisting. As an example, see the attached test_simple.wdl. The task in this wdl does the following:

print "first ls"
ls
create file this_file_should_only_exist_for_second_ls.{current_date_and_time}
print "second ls"
ls

it also specifies an output file output_file which is never created, and preemptible: 1

My expectation is that when run, the first ls will print out the current directory contents (which will not include anything matching this_file_should_only_exist_for_second_ls.* ) and then the second ls will have the same directory contents except also with a this_file_should_only_exist_for_second_ls.{current_date_and_time} included, and then the workflow will fail due to the non existent output file. This is indeed the behavior when run with PAPIv2 backend, based on looking at the stdout in the execution bucket.

However, if I run with GCPBATCH backend, at first the stdout in the execution bucket will show the expected behavior. However, the workflow will keep running, and the stdout will soon change. In the updated stdout, the first ls will include a single file matching this_file_should_only_exist_for_second_ls.* as well as an unexpected rc file, along with two (instead of one) tmp.* files. The second ls will then include a second file matching this_file_should_only_exist_for_second_ls.* , along with the other unexpected files. My interpretation of this is that the task is being reattempted with the same filesystem, without any cleaning of the filesystem between runs.

I have also found that if the task is not run with preemptible, or if there aren't any missing output files, then the behavior will be as expected (even if the task still fails due to a non-zero exit code). There's a few different combinations of settings I tried in test.wdl.

I also see the following stack trace in cromwell, though unclear if it is related

[2024-08-08 15:29:20,58] [info] WorkflowManagerActor: Workflow 5892e197-e4c8-4d75-a648-e906d5ec80e6 failed (during ExecutingWorkflowState): java.lang.RuntimeException: Task test.test_task_output_does_not_exist:NA:1 failed for unknown reason: Failed
	at cromwell.backend.standard.StandardAsyncExecutionActor.handleExecutionFailure(StandardAsyncExecutionActor.scala:1170)
	at cromwell.backend.standard.StandardAsyncExecutionActor.handleExecutionFailure$(StandardAsyncExecutionActor.scala:1169)
	at cromwell.backend.google.batch.actors.GcpBatchAsyncBackendJobExecutionActor.handleExecutionFailure(GcpBatchAsyncBackendJobExecutionActor.scala:123)
	at cromwell.backend.standard.StandardAsyncExecutionActor.$anonfun$handleExecutionResult$9(StandardAsyncExecutionActor.scala:1435)
	at scala.concurrent.impl.Promise$Transformation.run(Promise.scala:470)
	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

example wdls test_simple.wdl and test.wdl are packaged here: wdls.zip

The text was updated successfully, but these errors were encountered:

dspeck1 · 2024-08-20T21:12:32Z

Hi @kachulis - this behavior is expected with GCP Batch as the persistent disk gets remounted to the replacement VM. There is not a way to specify a new disk on the Google Cloud side so please add some logic to your workflow to clear out files if present if that is needed.

aednichols added the GCP Batch label Aug 8, 2024

dspeck1 closed this as completed Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GCPBATCH preemptible and missing output file leads to retry with persistent filesystem. #7489

GCPBATCH preemptible and missing output file leads to retry with persistent filesystem. #7489

kachulis commented Aug 8, 2024

dspeck1 commented Aug 20, 2024

GCPBATCH preemptible and missing output file leads to retry with persistent filesystem. #7489

GCPBATCH preemptible and missing output file leads to retry with persistent filesystem. #7489

Comments

kachulis commented Aug 8, 2024

dspeck1 commented Aug 20, 2024