Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Retry not working, Cromwell 87 #7451

Open
GregoryDougherty opened this issue Jun 14, 2024 · 1 comment
Open

Memory Retry not working, Cromwell 87 #7451

GregoryDougherty opened this issue Jun 14, 2024 · 1 comment

Comments

@GregoryDougherty
Copy link

We can not get memory retry to work. Have not not found anywhere a complete example showing it working, including what should be int eh .conf file. If such an example exists, please point us to it,

Command:
nohup java -Dconfig.file=My.conf -jar cromwell-87-5448b85-SNAP-pre-edits.jar run ~/MemoryRetryTest.wdl 2>&1 > nohup.out

MemoryRetryTest.wdl:
workflow MemoryRetryTest {
String message = "Killed"

call TestOutOfMemoryRetry {}
call TestBadCommandRetry {}

}

task TestOutOfMemoryRetry {
command <<<
free -h
df -h
cat /proc/cpuinfo

	echo "Killed" >&2
	tail /dev/zero
>>>

runtime {
	cpu: "1"
	memory: "1 GB"
	maxRetries: 4
	continueOnReturnCode: 0
}

}

task TestBadCommandRetry {
command <<<
free -h
df -h
cat /proc/cpuinfo

	echo "Killed" >&2
	bedtools intersect nothing with nothing
>>>

runtime {
	cpu: "1"
	memory: "1 GB"
	maxRetries: 4
	continueOnReturnCode: 0
}

}

My.conf:

include required(classpath("application"))

system {
memory-retry-error-keys = ["OutOfMemory", "Killed", "Error:"]
}

backend {
default = PAPIv2

providers {
PAPIv2 {
actor-factory = "cromwell.backend.google.pipelines.v2beta.PipelinesApiLifecycleActorFactory"

  system {
    memory-retry-error-keys = ["OutOfMemory", "Killed", "Error:"]
  }
  config {
    project = "$my_project"
    root = "$my_bucket"
    name-for-call-caching-purposes: PAPI
    slow-job-warning-time: 24 hours
    genomics-api-queries-per-100-seconds = 1000
    maximum-polling-interval = 600

    # Setup GCP to give more memory with each retry
    system {
      memory-retry-error-keys = ["OutOfMemory", "Killed", "Error:"]
    }
    system.memory-retry-error-keys = ["OutOfMemory", "Killed", "Error:"]
    memory_retry_multiplier = 4
    
    # Number of workers to assign to PAPI requests
    request-workers = 3

    virtual-private-cloud {
      network-label-key = "network-key"
      network-name = "network-name"
      subnetwork-name = "subnetwork-name"
      auth = "auth"
      }
    pipeline-timeout = 7 days
    genomics {
      auth = "auth"
      compute-service-account = "$my_account"
      endpoint-url = "https://lifesciences.googleapis.com/"
      location = "us-central1"
      restrict-metadata-access = false
      localization-attempts = 3
      parallel-composite-upload-threshold="150M"
    }
    filesystems {
      gcs {
        auth = "auth"
        project = "$my_project"
        caching {
          duplication-strategy = "copy"
        }
      }
    }
    system {
      memory-retry-error-keys = ["OutOfMemory", "Killed", "Error:"]
    }
    runtime {
      cpuPlatform: "Intel Cascade Lake"
    }
    default-runtime-attributes {
      cpu: 1
      failOnStderr: false
      continueOnReturnCode: 0
      memory: "2048 MB"
      bootDiskSizeGb: 10
      disks: "local-disk 375 SSD"
      noAddress: true
      preemptible: 1
      maxRetries: 3
      system.memory-retry-error-keys = ["OutOfMemory", "Killed", "Error:"]
      memory_retry_multiplier = 4
      zones: ["us-central1-a", "us-central1-b"]
    }

    include "papi_v2_reference_image_manifest.conf"
  }
}

}
}

gustily ls gs://cromwell-executions/MemoryRetryTest/d54a5a39-4d3b-4ac7-9bb1-97043d761b56/call-TestOutOfMemoryRetry
TestOutOfMemoryRetry.log
gcs_delocalization.sh
gcs_localization.sh
gcs_transfer.sh
rc
script
stderr
stdout
pipelines-logs

stderr:
Killed
/cromwell_root/script: line 32: 17 Killed tail /dev/zero

rc:
137

@sicotteh
Copy link

I would like to add my support to Greg's question.

The

memory_retry_multiplier

config option would be a super-super useful feature to use for genomic workflows with varying data sises.

If it is working on GCP, ould you please document it's use better? Or let us know if it is an abandonned feature.. Or even better send us working examples :)

Thanks for all the work you do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants