User-defined retries #1991

kshakir · 2017-02-16T01:12:53Z

Briefest of discussions with Jose. NOTE: All naming up in the air.

Enable a runtime attribute such as retryOnStderrPattern that populates a value retryAttempt/retry/retryCount/retry_count/etc. This will enable tasks such as:

task mytask {
  command {
     mycommand.sh
  }
  runtime {
    retryOnStderrPattern = "(OutOfMemoryError|disk quota exceeded)"
    memory = (6 * retryAttempt) + "GB"
    disk = "local-disk " + (100 * retryAttempt) + " SSD"
    docker = "myrepo/myimage"
  }
}

When the stderr contains the specified regular expression pattern, the job should be retried with the counter incremented.

Not discussed afaik, how to limit the number of attempts: another runtime attribute, a backend config value, both, other?

The text was updated successfully, but these errors were encountered:

katevoss · 2017-02-16T16:23:45Z

Adding @vdauwera's comment about adding error codes to GATK from DSDE-docs #1742:

We may be able to put in error codes for things like this in GATK4. Should ask David Roazen or Louis Bergelson.

kcibul · 2017-02-16T16:38:15Z

Just a though --what if we added a "retryOn" attribute that took a boolean expression. Then we add a function like grep(pattern, var/file, mode) <see R for example function> which would probably be similar work and much more reusable. Then Jose could write something like retryOn = grep("(foo|bar", $stderr, "any") But could also check any other element. Syntax is lousy typing from tiny keyboard

…

On Feb 16, 2017, at 11:23 AM, Kate Voss ***@***.***> wrote: Adding @vdauwera's comment about adding error codes to GATK from DSDE-docs #1742: We may be able to put in error codes for things like this in GATK4. Should ask David Roazen or Louis Bergelson. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

katevoss · 2017-03-23T15:03:21Z

This is de-prioritized for the time being.

droazen · 2017-05-24T19:32:27Z

Could I propose that this be re-prioritized? It would help us deal with transient GCS hiccups in production (eg., connections suddenly getting closed, etc.). Individual tools in the GATK and Picard can't possibly catch every exception across every library involved, so an execution-framework-level retry at the job level would help enormously.

lbergelson · 2017-07-14T16:39:19Z

This would be extremely useful for us. We're currently having to deal with several problems that would be helped by an automated retry ability.

The first problems is what David said, we have tools that can fail sometimes due to GCS issues and being able to restart when that happens would be useful. We're working on making our code more robust to that, but it's difficult to completely fix the problem. Having to restart a workflow with 10s of thousands of jobs because 2 failed is pretty annoying.

The second problem is out of memory issues. We have thousands of jobs, and most will run with a small amount of memory, but some of them will need more. It's difficult to predict ahead of time which shards will need more since it's a function of the data rather than of the file size. Having a way to automatically retry these shards with increased memory would be really valuable since it would let us provision for the average shard rather than the worst case.

LeeTL1220 · 2017-07-20T18:33:30Z

@katevoss This is actually really important, not just for @droazen and @lbergelson ... This issue has cost the Broad $$$ and analysts a lot of time. Not just the people on this issue. And putting retry code into the GATK (or any task for that matter) is bit arduous and actually a more expensive solution, especially when some random code path is missed.

Also, retry on memory should do a lot for us to be able to reduce costs.

kcibul · 2017-07-20T19:05:14Z

+1 on this feature (or one like it) -- it's really helpful for writing robust and cheap workflows

geoffjentry · 2017-07-20T19:18:02Z

FWIW this has been discussed as a key feature for a WDL push next quarter

gsaksena · 2017-08-29T16:58:55Z

Also, note that Google PD's can be expanded on the fly in seconds, even while the VM is still running under load. I've done this manually on non-FC VMs via the script below. Using this approach combined with a disk space monitoring process (and a size cap!) would allow the job to pass the first time, avoiding a retry. And... if it was also during the algorithm, not just data download, this could eradicate both disk space errors and disk over-provisioning.

https://github.com/broadinstitute/firecloud_developer_toolkit/blob/master/gce/expand_disk.sh

Unfortunately I don't know of a way to hot-swap RAM into the VM.

katevoss · 2017-08-29T19:55:52Z

From #1449, make sure "custom retries can increase memory, disk, AND
bootDiskInGb".

katevoss · 2017-09-07T18:46:20Z

From #1729, this is also important to @ktibbett and the Emerald Empire.

katevoss · 2017-09-07T21:44:13Z

As a workflow runner, I want Cromwell to automatically retry my workflow with increased memory/disk/on a specific error code, etc, so that I can get my workflow to complete without having to manually intervene.

Effort: ? @geoffjentry
Risk: Medium
- if users are unaware that they have retries set in ways that would cost them a lot the 2nd or tertiary run, i.e to double their memory, they could end up paying for a much more expensive VM when a smaller one would do
Business value: Large

droazen · 2017-11-02T19:42:19Z

Just wanted to check in to inquire about the status of this ticket. Has it been added to any milestones yet?

Although we've packed Google's GCS library full of retries, at considerable expense of developer time and effort, there are still blips in production where something like a host name lookup will randomly fail, as @jsotobroad can attest. Since it's difficult/impossible to bake in retries for every network operation in every library involved, as discussed above, this ticket is our best hope of dealing with these annoying little blips once and for all.

katevoss · 2017-11-14T19:51:20Z

@droazen I wish I had better news for you but unfortunately this has not made it into our priorities. In early December we will have more flexibility to take on additional features, and I will note this one as a possible item.

@geoffjentry do you have an estimate for the effort it would take? Could it be a User Driven Development project if a team member is so motivated?

LeeTL1220 · 2017-11-14T19:54:43Z

This issue causes us a lot of pain and lost time.

…

On Tue, Nov 14, 2017 at 2:51 PM, Kate Voss ***@***.***> wrote: @droazen <https://github.com/droazen> I wish I had better news for you but unfortunately this has not made it into our priorities. In early December we will have more flexibility to take on additional features, and I will note this one as a possible item. @geoffjentry <https://github.com/geoffjentry> do you have an estimate for the effort it would take? Could it be a User Driven Development project if a team member is so motivated? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1991 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACDXkzk8mnnWtEcY07IcNkF7jyrCW-ecks5s2e88gaJpZM4MCdgW> .

-- Lee Lichtenstein Broad Institute 75 Ames Street, Room 8011A Cambridge, MA 02142 617 714 8632

snovod · 2017-11-14T20:59:40Z

Just wanted to add this causes Ops LOTs of pain. If this could be addressed it would be a huge help.

geoffjentry · 2017-11-14T21:10:35Z

If the implementation of this involves WDL it's no longer an "us" thing. There are other possible ways to implement this, but to date I've only seen/heard WDL-based ones (and it seems like the right path to do it that way)

droazen · 2017-11-14T21:11:51Z

@katevoss Ah, that's too bad. As @snovod and @LeeTL1220 mention above, this is a massive pain point for both Ops and Methods right now, and wastes a lot of our time, so if it could somehow get prioritized in the near future we'd be grateful!

katevoss · 2017-11-14T21:12:16Z

In that case, I shall bring this to the attention of Commodore WDL ( @cjllanwarne )!

geoffjentry · 2017-11-14T21:13:44Z

I'd encourage interested parties (e.g. @drozen) to directly interface w/ OpenWDL

eitanbanks · 2017-11-14T21:20:01Z

What does that mean? How do we get you guys to prioritize work on it?
(This is directed at Jeff's comment that we should interface elsewhere)

geoffjentry · 2017-11-14T21:22:47Z

@eitanbanks To modify the WDL requires a change to the WDL spec. There's a process that's not directly controlled by the Cromwell team (or the Broad): https://github.com/openwdl/wdl/blob/master/GOVERNANCE.md#rfc-process

IOW on our side of the fence we can try to work on getting it in but there's no guarantee of it ever happening even if we want it to (note: that's a worst case, I don't expect that'd really happen)

geoffjentry · 2017-11-14T21:24:39Z

The other relavent bit I should mention is that there's a separate WDL team starting next quarter (currently just @cjllanwarne but hopefully by then it'll be 2 ppl) - we've been talking about managing it in a manner similar to the field engineering team but that's still kind of up in the air at the moment.

eitanbanks · 2017-11-14T21:27:25Z

Right, understood about the WDL spec being open. But it's not helpful for us to have to advocate to an external committee to get features added in. The hope is that this team (or the new WDL team) can represent us in those discussions. Our goal is to try to convince Kate that this is important enough to prioritize.

geoffjentry · 2017-11-14T21:29:41Z

Sure, but the second point I made is that that's also likely not how prioritization will work for the WDL team

eitanbanks · 2017-11-14T21:30:57Z

So bribery, like with Field Eng?

LeeTL1220 · 2017-11-14T21:33:24Z

Seriously: What is the prioritization process here? Not seriously: Do I need to go to an ATM?

…

On Tue, Nov 14, 2017 at 4:30 PM, Eric Banks ***@***.***> wrote: So bribery, like with Field Eng? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1991 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACDXk2zeQ6zcnmqLNr8NBIKE8ZPJSFI3ks5s2gaTgaJpZM4MCdgW> .

-- Lee Lichtenstein Broad Institute 75 Ames Street, Room 8011A Cambridge, MA 02142 617 714 8632

droazen · 2017-11-14T21:39:39Z

Methods and Ops could put together a bribe package if necessary...

snovod · 2017-11-14T21:47:45Z

We should stick to our respected roles. You put it together and we will ensure it gets delivered.

awacs · 2017-11-22T15:10:32Z

This would be extremely helpful.

LeeTL1220 · 2018-01-12T14:29:36Z

Ping! We still would like this. Is it on the roadmap?

guma44 · 2019-09-27T10:27:59Z

Hey, is there any info on this ticket? It is really desirable and it is hanging for two years now.

gemmalam · 2019-09-27T13:40:34Z

@guma44 we have moved to Jira and this ticket is in review at the moment on our new board. Please check out this ticket https://broadworkbench.atlassian.net/browse/BA-5933 or this Pull Request for updates #5180

aednichols · 2019-09-27T15:36:10Z

The PR that merged addresses some, but not all of the scope of this issue. It specifically targets running out of memory. I believe it may lay a helpful foundation for future work, however.

To set some expectations @guma44 we get many more issues than we have time to work on, and our institutional sponsors (who pay the rent) get first dibs on selecting the most important ones.

We are always happy to consider PRs if you feel like contributing yourself.

guma44 · 2019-09-30T10:49:56Z

@gemmalam, @aednichols Thanks for reply. I see the PR very promising for our setup. It would probably solve our problems for now. Concerning, contribution I would do this if I knew more Java/Scala but I have not been programming in those languages for years.

katevoss added this to the Q1 - Cromwell User Joy milestone Feb 21, 2017

geoffjentry mentioned this issue Mar 9, 2017

Custom retry strategies #1847

Closed

katevoss added the Retry label Mar 13, 2017

katevoss changed the title ~~Enable retries based on a standard error pattern~~ User-defined retries Mar 21, 2017

katevoss removed this from the Q1 - Cromwell User Joy milestone Mar 23, 2017

katevoss mentioned this issue Aug 24, 2017

Automatically increase JES VM boot disk size when docker download fails due to full storage #1449

Closed

katevoss added the WDL Developer Joy label Aug 29, 2017

katevoss added the PO Cleanup label Sep 7, 2017

katevoss mentioned this issue Sep 7, 2017

Support re-running a task with modified resources #1729

Closed

katevoss added the 📖needs docs label Sep 14, 2017

dshiga mentioned this issue Jan 18, 2018

Workflow option to retry any task failure #3161

Closed

katevoss removed the PO Cleanup label Jan 18, 2018

geoffjentry mentioned this issue Mar 15, 2018

Add options for retry calls #3417

Closed

kbergin added the 🌿Mint Friends label May 8, 2018

DavyCats mentioned this issue Nov 1, 2018

Feature Request: Memory increase on retry #4346

Closed

User-defined retries #1991

User-defined retries #1991

Comments

kshakir commented Feb 16, 2017

katevoss commented Feb 16, 2017

kcibul commented Feb 16, 2017 via email

katevoss commented Mar 23, 2017

droazen commented May 24, 2017

lbergelson commented Jul 14, 2017

LeeTL1220 commented Jul 20, 2017

kcibul commented Jul 20, 2017

geoffjentry commented Jul 20, 2017

gsaksena commented Aug 29, 2017 • edited Loading

katevoss commented Aug 29, 2017

katevoss commented Sep 7, 2017

katevoss commented Sep 7, 2017

droazen commented Nov 2, 2017

katevoss commented Nov 14, 2017

LeeTL1220 commented Nov 14, 2017 via email

snovod commented Nov 14, 2017

geoffjentry commented Nov 14, 2017

droazen commented Nov 14, 2017

katevoss commented Nov 14, 2017

geoffjentry commented Nov 14, 2017

eitanbanks commented Nov 14, 2017 • edited Loading

geoffjentry commented Nov 14, 2017

geoffjentry commented Nov 14, 2017

eitanbanks commented Nov 14, 2017

geoffjentry commented Nov 14, 2017

eitanbanks commented Nov 14, 2017

LeeTL1220 commented Nov 14, 2017 via email

droazen commented Nov 14, 2017

snovod commented Nov 14, 2017

awacs commented Nov 22, 2017

LeeTL1220 commented Jan 12, 2018

guma44 commented Sep 27, 2019

gemmalam commented Sep 27, 2019

aednichols commented Sep 27, 2019

guma44 commented Sep 30, 2019

gsaksena commented Aug 29, 2017 •

edited

Loading

eitanbanks commented Nov 14, 2017 •

edited

Loading