Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry with double memory [BA-5933] #5180

Merged
merged 24 commits into from
Sep 27, 2019
Merged

Retry with double memory [BA-5933] #5180

merged 24 commits into from
Sep 27, 2019

Conversation

salonishah11
Copy link
Contributor

@salonishah11 salonishah11 commented Sep 16, 2019

This PR adds a config option for user defined retries. With memory-retry the user can specify an array of strings which when encountered in the stderr file by Cromwell, allows the task to be retried with a factor also mentioned in the config.

JIRA issue link.

Copy link
Contributor

@kshakir kshakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent PoC. I'm happy to 👍 if the PoC needs to go out asap, but in general I think the concept of "DoubleMemory" could be replaced with a more generic "MoreMemory".

CHANGELOG.md Outdated Show resolved Hide resolved

jobKey.memoryDoubleMultiplier match {
case 1 => runtimeAttributes
case multiplier: Int => doubleRuntimeMemory(multiplier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToL: I haven't been following the ticket too closely, so this is mostly thinking for future PRs. A hardcoded factor of 2.0 is fine for now, but I could see someone wanting eventually wanting a multiplier of 1.1 or even 10.0.

@@ -26,7 +26,7 @@ import scala.util.Try
/**
* For uniquely identifying a job which has been or will be sent to the backend.
*/
case class BackendJobDescriptorKey(call: CommandCallNode, index: Option[Int], attempt: Int) extends CallKey {
case class BackendJobDescriptorKey(call: CommandCallNode, index: Option[Int], attempt: Int, memoryDoubleMultiplier: Int = 1) extends CallKey {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps call the field memoryMultiplier? While here, I'm also a fan of Double multipliers vs. Int.

@@ -32,6 +32,7 @@ trait JobPaths {

def workflowPaths: WorkflowPaths
def returnCodeFilename: String = "rc"
def doubleMemoryRetryRCFilename: String = "double_memory_retry_rc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we remove the hardcoding of the multiplier, how about val memoryRetryRcFilename = "memory_retry_rc"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building on that idea, memory_retry_rc_filename would make it more clear it's a filename we're looking at and not a code.

@@ -99,8 +99,8 @@ trait AsyncBackendJobExecutionActor { this: Actor with ActorLogging with SlowJob
case Finish(FailedNonRetryableExecutionHandle(throwable, returnCode)) =>
completionPromise.success(JobFailedNonRetryableResponse(jobDescriptor.key, throwable, returnCode))
context.stop(self)
case Finish(FailedRetryableExecutionHandle(throwable, returnCode)) =>
completionPromise.success(JobFailedRetryableResponse(jobDescriptor.key, throwable, returnCode))
case Finish(FailedRetryableExecutionHandle(throwable, returnCode, retryWithDoubleMemory)) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retryWithMoreMemory?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps another producty question for @ruchim. The JIRA ticket title says "more" but the description says "double".

pushFailedCallMetadata(jobKey, returnCode, reason, retryableFailure = true)

val newJobKey = jobKey.copy(attempt = jobKey.attempt + 1)
val currentMemoryMultiplier = jobKey.memoryDoubleMultiplier
val newMemoryMultiplier = if (retryWithDoubleMemory) currentMemoryMultiplier * 2 else currentMemoryMultiplier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This magic number 2.0 could be another knob config/option.

include "papi_provider_config.inc.conf"
genomics.compute-service-account = "centaur@broad-dsde-cromwell-dev.iam.gserviceaccount.com"
filesystems.http {}
retry-with-double-memory = ["OutOfMemoryError"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToL: Future PR..?

retry-memory {
  error-messages = ["OutOfMemoryError"]
  multiplier = 2.0
}

@@ -447,7 +447,8 @@ class PipelinesApiAsyncBackendJobExecutionActor(override val standardParams: Sta
privateDockerKeyAndEncryptedToken = dockerKeyAndToken,
womOutputRuntimeExtractor = jobDescriptor.workflowDescriptor.outputRuntimeExtractor,
adjustedSizeDisks = adjustedSizeDisks,
virtualPrivateCloudConfiguration = jesAttributes.virtualPrivateCloudConfiguration
virtualPrivateCloudConfiguration = jesAttributes.virtualPrivateCloudConfiguration,
retryWithDoubleMemoryKeys = jesAttributes.retryWithDoubleMemoryKeys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pro Tip: If you leave a trailing comma here and on the lines below, you can pay-it-forward so future devs won't get extra git blame for just adding a , to each list.

https://docs.scala-lang.org/sips/completed/trailing-commas.html

@@ -141,6 +141,11 @@ object ActionCommands {
|fi""".stripMargin
}

def checkIfStderrContainsRetryKeys(retryLookupKeys: List[String]): String = {
val lookupKeysAsString = retryLookupKeys.mkString("|")
s"grep -E -q '$lookupKeysAsString' /cromwell_root/stderr ; echo $$? > /cromwell_root/double_memory_retry_rc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToL: Nice... nothing Papi here except rapid-prototyping.

@salonishah11
Copy link
Contributor Author

@kshakir I had thought of making the multiplier 2 as a config option. But since this ticket had evolved to be a PoC, I kept it constant at 2. If we decide to not rush this PR, than I am all in for 'MoreMemory' as compared to 'DoubleMemory'.

Copy link
Contributor

@mcovarr mcovarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments for now with a couple of Ruchi tags for clarification 😉

@@ -32,6 +32,7 @@ trait JobPaths {

def workflowPaths: WorkflowPaths
def returnCodeFilename: String = "rc"
def doubleMemoryRetryRCFilename: String = "double_memory_retry_rc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building on that idea, memory_retry_rc_filename would make it more clear it's a filename we're looking at and not a code.

@@ -72,6 +73,7 @@ trait JobPaths {
lazy val script = callExecutionRoot.resolve(scriptFilename)
lazy val dockerCid = callExecutionRoot.resolve(dockerCidFilename)
lazy val returnCode = callExecutionRoot.resolve(returnCodeFilename)
lazy val doubleMemoryRetryRC = callExecutionRoot.resolve(doubleMemoryRetryRCFilename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume if Khalid's comments above regarding "more" vs "double" are acted upon then this should change too.

@@ -51,4 +54,42 @@ class ActionBuilderSpec extends FlatSpec with Matchers with TableDrivenPropertyC
}
}

def grepForRetryKeysCommand(lookupString: String) =
s"grep -E -q '$lookupString' /cromwell_root/stderr ; echo $$? > /cromwell_root/double_memory_retry_rc"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should be in one place in the production code with visibility such that the test code can see it

CHANGELOG.md Outdated Show resolved Hide resolved
@@ -99,8 +99,8 @@ trait AsyncBackendJobExecutionActor { this: Actor with ActorLogging with SlowJob
case Finish(FailedNonRetryableExecutionHandle(throwable, returnCode)) =>
completionPromise.success(JobFailedNonRetryableResponse(jobDescriptor.key, throwable, returnCode))
context.stop(self)
case Finish(FailedRetryableExecutionHandle(throwable, returnCode)) =>
completionPromise.success(JobFailedRetryableResponse(jobDescriptor.key, throwable, returnCode))
case Finish(FailedRetryableExecutionHandle(throwable, returnCode, retryWithDoubleMemory)) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps another producty question for @ruchim. The JIRA ticket title says "more" but the description says "double".

/*
`CheckingForRetryWithDoubleMemory` action in Papiv2 exits with code 0 if the stderr file contains keys
mentioned in `retry-with-double-memory` config.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generic code, not specific to Papiv2.

mockJobKeyWithMemoryMultiplier4.node returns call
mockJobKeyWithMemoryMultiplier4.index returns None
mockJobKeyWithMemoryMultiplier4.attempt returns 3
mockJobKeyWithMemoryMultiplier4.memoryDoubleMultiplier returns 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm a "double multiplier" with different values is a bit confusing

@salonishah11
Copy link
Contributor Author

@kshakir @mcovarr I have changed the multiplier factor to be configurable and made things more generic instead of ...doubleMemory.... It's now ready for re-review!

@@ -49,13 +50,16 @@ object PipelinesApiConfigurationAttributes {

final case class VirtualPrivateCloudConfiguration(name: String, subnetwork: Option[String], auth: GoogleAuthMode)
final case class BatchRequestTimeoutConfiguration(readTimeoutMillis: Option[Int Refined Positive], connectTimeoutMillis: Option[Int Refined Positive])
final case class MemoryRetryConfiguration(errorKeys: List[String], multiplier: Double)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be propagated as multiplier: Double Refined Positive, such that the multiplier type is always positive throughout the code?

Copy link
Contributor Author

@salonishah11 salonishah11 Sep 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I somehow missed this comment. I will take a look. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the multiplier at this point, which is when Cromwell is started and Papi backend is initialized, is validated that it is a positive double, but not used. It is used in the StandardAsyncExecutionActor when a task has failed and the retryWithMoreMemory condition is true. I am not revalidating again because if it was not a valid double, Cromwell would have thrown an error when it was started and validated in the Papi backend. Does that make sense? I can remove the multiplier field from MemoryRetryConfiguration altogether to make it easy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also revalidate that the multiplier is a positive double again at StandardAsyncExecutionActor, but if at this point it is not a valid double, we can switch the multiplier value to default (which is 2) or capture the error and fail the task without retrying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After talking and coding f2f, the code has been updated to use Double Refined Positive type for multiplier all the way through.

Copy link
Contributor

@kshakir kshakir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I missed the reply to the open question above? If you point me at the reply and hit ♻️ and I'll re-review.

docs/backends/Google.md Outdated Show resolved Hide resolved
@@ -141,6 +143,8 @@ class PipelinesApiAsyncBackendJobExecutionActor(override val standardParams: Sta
case _ => false
}

override lazy val memoryRetryFactor: Option[Double Refined Positive] = initializationData.papiConfiguration.papiAttributes.memoryRetryConfiguration.map(_.multiplier)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TOL hmm really this should be > 1.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In strict sense, yes. But I am not sure if there is any harm to have a memory factor > 0 but less than < 1. Something to discuss during tech talk?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with a memoryRetryFactor being GreaterEqual[] vs. Positive.

Also ToL for tech talk: I'm curious if this would be easier to follow, though it's about as verbose when used.

type MemoryRetryFactor = GreaterEqual[W.`1.0`.T]

Or is convention to alias the refined type?

type MemoryRetryFactor = Refined[Double, GreaterEqual[W.`1.0`.T]]

Or some mix of the two?

}
}
```
this tells Cromwell to retry the task with 1.1x memory when it sees either `OutOfMemoryError` or `Killed` in the `stderr` file. If the task has
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to be careful about is Killed might also mean preempted. So it may be better to check explicitly that the call wasn't preempted, and increase memory only when it wasn't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is certainly a good point. Since this PR has been developed as a POC, we are trying to keep it simple. But I will note this down for when we start adding more features to it. Thank you!

@salonishah11 salonishah11 merged commit e956df7 into develop Sep 27, 2019
@salonishah11 salonishah11 deleted the ss_double_memory branch September 27, 2019 14:07
salonishah11 added a commit that referenced this pull request Sep 27, 2019
@dinvlad
Copy link
Contributor

dinvlad commented Nov 10, 2019

Unfortunately, this does not yet work in practice. I opened https://broadworkbench.atlassian.net/browse/BA-6112 re the cause.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants