Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

User-defined retries #1991

Open
kshakir opened this issue Feb 16, 2017 · 35 comments
Open

User-defined retries #1991

kshakir opened this issue Feb 16, 2017 · 35 comments

Comments

@kshakir
Copy link
Contributor

kshakir commented Feb 16, 2017

Briefest of discussions with Jose. NOTE: All naming up in the air.

Enable a runtime attribute such as retryOnStderrPattern that populates a value retryAttempt/retry/retryCount/retry_count/etc. This will enable tasks such as:

task mytask {
  command {
     mycommand.sh
  }
  runtime {
    retryOnStderrPattern = "(OutOfMemoryError|disk quota exceeded)"
    memory = (6 * retryAttempt) + "GB"
    disk = "local-disk " + (100 * retryAttempt) + " SSD"
    docker = "myrepo/myimage"
  }
}

When the stderr contains the specified regular expression pattern, the job should be retried with the counter incremented.

Not discussed afaik, how to limit the number of attempts: another runtime attribute, a backend config value, both, other?

@katevoss
Copy link

Adding @vdauwera's comment about adding error codes to GATK from DSDE-docs #1742:

We may be able to put in error codes for things like this in GATK4. Should ask David Roazen or Louis Bergelson.

@kcibul
Copy link
Contributor

kcibul commented Feb 16, 2017 via email

@katevoss katevoss added this to the Q1 - Cromwell User Joy milestone Feb 21, 2017
@katevoss katevoss added the Retry label Mar 13, 2017
@katevoss katevoss changed the title Enable retries based on a standard error pattern User-defined retries Mar 21, 2017
@katevoss
Copy link

This is de-prioritized for the time being.

@katevoss katevoss removed this from the Q1 - Cromwell User Joy milestone Mar 23, 2017
@droazen
Copy link

droazen commented May 24, 2017

Could I propose that this be re-prioritized? It would help us deal with transient GCS hiccups in production (eg., connections suddenly getting closed, etc.). Individual tools in the GATK and Picard can't possibly catch every exception across every library involved, so an execution-framework-level retry at the job level would help enormously.

@lbergelson
Copy link
Member

This would be extremely useful for us. We're currently having to deal with several problems that would be helped by an automated retry ability.

The first problems is what David said, we have tools that can fail sometimes due to GCS issues and being able to restart when that happens would be useful. We're working on making our code more robust to that, but it's difficult to completely fix the problem. Having to restart a workflow with 10s of thousands of jobs because 2 failed is pretty annoying.

The second problem is out of memory issues. We have thousands of jobs, and most will run with a small amount of memory, but some of them will need more. It's difficult to predict ahead of time which shards will need more since it's a function of the data rather than of the file size. Having a way to automatically retry these shards with increased memory would be really valuable since it would let us provision for the average shard rather than the worst case.

@LeeTL1220
Copy link

@katevoss This is actually really important, not just for @droazen and @lbergelson ... This issue has cost the Broad $$$ and analysts a lot of time. Not just the people on this issue. And putting retry code into the GATK (or any task for that matter) is bit arduous and actually a more expensive solution, especially when some random code path is missed.

Also, retry on memory should do a lot for us to be able to reduce costs.

@kcibul
Copy link
Contributor

kcibul commented Jul 20, 2017

+1 on this feature (or one like it) -- it's really helpful for writing robust and cheap workflows

@geoffjentry
Copy link
Contributor

FWIW this has been discussed as a key feature for a WDL push next quarter

@gsaksena
Copy link

gsaksena commented Aug 29, 2017

Also, note that Google PD's can be expanded on the fly in seconds, even while the VM is still running under load. I've done this manually on non-FC VMs via the script below. Using this approach combined with a disk space monitoring process (and a size cap!) would allow the job to pass the first time, avoiding a retry. And... if it was also during the algorithm, not just data download, this could eradicate both disk space errors and disk over-provisioning.

https://github.com/broadinstitute/firecloud_developer_toolkit/blob/master/gce/expand_disk.sh

Unfortunately I don't know of a way to hot-swap RAM into the VM.

@katevoss
Copy link

From #1449, make sure "custom retries can increase memory, disk, AND
bootDiskInGb".

@katevoss
Copy link

katevoss commented Sep 7, 2017

From #1729, this is also important to @ktibbett and the Emerald Empire.

@katevoss
Copy link

katevoss commented Sep 7, 2017

As a workflow runner, I want Cromwell to automatically retry my workflow with increased memory/disk/on a specific error code, etc, so that I can get my workflow to complete without having to manually intervene.

  • Effort: ? @geoffjentry
  • Risk: Medium
    • if users are unaware that they have retries set in ways that would cost them a lot the 2nd or tertiary run, i.e to double their memory, they could end up paying for a much more expensive VM when a smaller one would do
  • Business value: Large

@droazen
Copy link

droazen commented Nov 2, 2017

Just wanted to check in to inquire about the status of this ticket. Has it been added to any milestones yet?

Although we've packed Google's GCS library full of retries, at considerable expense of developer time and effort, there are still blips in production where something like a host name lookup will randomly fail, as @jsotobroad can attest. Since it's difficult/impossible to bake in retries for every network operation in every library involved, as discussed above, this ticket is our best hope of dealing with these annoying little blips once and for all.

@katevoss
Copy link

@droazen I wish I had better news for you but unfortunately this has not made it into our priorities. In early December we will have more flexibility to take on additional features, and I will note this one as a possible item.

@geoffjentry do you have an estimate for the effort it would take? Could it be a User Driven Development project if a team member is so motivated?

@LeeTL1220
Copy link

LeeTL1220 commented Nov 14, 2017 via email

@snovod
Copy link

snovod commented Nov 14, 2017

Just wanted to add this causes Ops LOTs of pain. If this could be addressed it would be a huge help.

@geoffjentry
Copy link
Contributor

If the implementation of this involves WDL it's no longer an "us" thing. There are other possible ways to implement this, but to date I've only seen/heard WDL-based ones (and it seems like the right path to do it that way)

@droazen
Copy link

droazen commented Nov 14, 2017

@katevoss Ah, that's too bad. As @snovod and @LeeTL1220 mention above, this is a massive pain point for both Ops and Methods right now, and wastes a lot of our time, so if it could somehow get prioritized in the near future we'd be grateful!

@katevoss
Copy link

In that case, I shall bring this to the attention of Commodore WDL ( @cjllanwarne )!

@geoffjentry
Copy link
Contributor

I'd encourage interested parties (e.g. @drozen) to directly interface w/ OpenWDL

@eitanbanks
Copy link

eitanbanks commented Nov 14, 2017

What does that mean? How do we get you guys to prioritize work on it?
(This is directed at Jeff's comment that we should interface elsewhere)

@geoffjentry
Copy link
Contributor

@eitanbanks To modify the WDL requires a change to the WDL spec. There's a process that's not directly controlled by the Cromwell team (or the Broad): https://github.com/openwdl/wdl/blob/master/GOVERNANCE.md#rfc-process

IOW on our side of the fence we can try to work on getting it in but there's no guarantee of it ever happening even if we want it to (note: that's a worst case, I don't expect that'd really happen)

@geoffjentry
Copy link
Contributor

The other relavent bit I should mention is that there's a separate WDL team starting next quarter (currently just @cjllanwarne but hopefully by then it'll be 2 ppl) - we've been talking about managing it in a manner similar to the field engineering team but that's still kind of up in the air at the moment.

@eitanbanks
Copy link

Right, understood about the WDL spec being open. But it's not helpful for us to have to advocate to an external committee to get features added in. The hope is that this team (or the new WDL team) can represent us in those discussions. Our goal is to try to convince Kate that this is important enough to prioritize.

@geoffjentry
Copy link
Contributor

Sure, but the second point I made is that that's also likely not how prioritization will work for the WDL team

@eitanbanks
Copy link

So bribery, like with Field Eng?

@LeeTL1220
Copy link

LeeTL1220 commented Nov 14, 2017 via email

@droazen
Copy link

droazen commented Nov 14, 2017

Methods and Ops could put together a bribe package if necessary...

@snovod
Copy link

snovod commented Nov 14, 2017

We should stick to our respected roles. You put it together and we will ensure it gets delivered.

@awacs
Copy link

awacs commented Nov 22, 2017

This would be extremely helpful.

@LeeTL1220
Copy link

Ping! We still would like this. Is it on the roadmap?

@guma44
Copy link

guma44 commented Sep 27, 2019

Hey, is there any info on this ticket? It is really desirable and it is hanging for two years now.

@gemmalam
Copy link

@guma44 we have moved to Jira and this ticket is in review at the moment on our new board. Please check out this ticket https://broadworkbench.atlassian.net/browse/BA-5933 or this Pull Request for updates #5180

@aednichols
Copy link
Collaborator

The PR that merged addresses some, but not all of the scope of this issue. It specifically targets running out of memory. I believe it may lay a helpful foundation for future work, however.

To set some expectations @guma44 we get many more issues than we have time to work on, and our institutional sponsors (who pay the rent) get first dibs on selecting the most important ones.

We are always happy to consider PRs if you feel like contributing yourself.

@guma44
Copy link

guma44 commented Sep 30, 2019

@gemmalam, @aednichols Thanks for reply. I see the PR very promising for our setup. It would probably solve our problems for now. Concerning, contribution I would do this if I knew more Java/Scala but I have not been programming in those languages for years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests