-
Notifications
You must be signed in to change notification settings - Fork 357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User-defined retries #1991
Comments
Adding @vdauwera's comment about adding error codes to GATK from DSDE-docs #1742:
|
Just a though --what if we added a "retryOn" attribute that took a boolean expression. Then we add a function like grep(pattern, var/file, mode) <see R for example function> which would probably be similar work and much more reusable. Then Jose could write something like
retryOn = grep("(foo|bar", $stderr, "any")
But could also check any other element. Syntax is lousy typing from tiny keyboard
… On Feb 16, 2017, at 11:23 AM, Kate Voss ***@***.***> wrote:
Adding @vdauwera's comment about adding error codes to GATK from DSDE-docs #1742:
We may be able to put in error codes for things like this in GATK4. Should ask David Roazen or Louis Bergelson.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
This is de-prioritized for the time being. |
Could I propose that this be re-prioritized? It would help us deal with transient GCS hiccups in production (eg., connections suddenly getting closed, etc.). Individual tools in the GATK and Picard can't possibly catch every exception across every library involved, so an execution-framework-level retry at the job level would help enormously. |
This would be extremely useful for us. We're currently having to deal with several problems that would be helped by an automated retry ability. The first problems is what David said, we have tools that can fail sometimes due to GCS issues and being able to restart when that happens would be useful. We're working on making our code more robust to that, but it's difficult to completely fix the problem. Having to restart a workflow with 10s of thousands of jobs because 2 failed is pretty annoying. The second problem is out of memory issues. We have thousands of jobs, and most will run with a small amount of memory, but some of them will need more. It's difficult to predict ahead of time which shards will need more since it's a function of the data rather than of the file size. Having a way to automatically retry these shards with increased memory would be really valuable since it would let us provision for the average shard rather than the worst case. |
@katevoss This is actually really important, not just for @droazen and @lbergelson ... This issue has cost the Broad $$$ and analysts a lot of time. Not just the people on this issue. And putting retry code into the GATK (or any task for that matter) is bit arduous and actually a more expensive solution, especially when some random code path is missed. Also, retry on memory should do a lot for us to be able to reduce costs. |
+1 on this feature (or one like it) -- it's really helpful for writing robust and cheap workflows |
FWIW this has been discussed as a key feature for a WDL push next quarter |
Also, note that Google PD's can be expanded on the fly in seconds, even while the VM is still running under load. I've done this manually on non-FC VMs via the script below. Using this approach combined with a disk space monitoring process (and a size cap!) would allow the job to pass the first time, avoiding a retry. And... if it was also during the algorithm, not just data download, this could eradicate both disk space errors and disk over-provisioning. https://github.com/broadinstitute/firecloud_developer_toolkit/blob/master/gce/expand_disk.sh Unfortunately I don't know of a way to hot-swap RAM into the VM. |
From #1449, make sure "custom retries can increase |
As a workflow runner, I want Cromwell to automatically retry my workflow with increased memory/disk/on a specific error code, etc, so that I can get my workflow to complete without having to manually intervene.
|
Just wanted to check in to inquire about the status of this ticket. Has it been added to any milestones yet? Although we've packed Google's GCS library full of retries, at considerable expense of developer time and effort, there are still blips in production where something like a host name lookup will randomly fail, as @jsotobroad can attest. Since it's difficult/impossible to bake in retries for every network operation in every library involved, as discussed above, this ticket is our best hope of dealing with these annoying little blips once and for all. |
@droazen I wish I had better news for you but unfortunately this has not made it into our priorities. In early December we will have more flexibility to take on additional features, and I will note this one as a possible item. @geoffjentry do you have an estimate for the effort it would take? Could it be a User Driven Development project if a team member is so motivated? |
This issue causes us a lot of pain and lost time.
…On Tue, Nov 14, 2017 at 2:51 PM, Kate Voss ***@***.***> wrote:
@droazen <https://github.com/droazen> I wish I had better news for you
but unfortunately this has not made it into our priorities. In early
December we will have more flexibility to take on additional features, and
I will note this one as a possible item.
@geoffjentry <https://github.com/geoffjentry> do you have an estimate for
the effort it would take? Could it be a User Driven Development project if
a team member is so motivated?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1991 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACDXkzk8mnnWtEcY07IcNkF7jyrCW-ecks5s2e88gaJpZM4MCdgW>
.
--
Lee Lichtenstein
Broad Institute
75 Ames Street, Room 8011A
Cambridge, MA 02142
617 714 8632
|
Just wanted to add this causes Ops LOTs of pain. If this could be addressed it would be a huge help. |
If the implementation of this involves WDL it's no longer an "us" thing. There are other possible ways to implement this, but to date I've only seen/heard WDL-based ones (and it seems like the right path to do it that way) |
@katevoss Ah, that's too bad. As @snovod and @LeeTL1220 mention above, this is a massive pain point for both Ops and Methods right now, and wastes a lot of our time, so if it could somehow get prioritized in the near future we'd be grateful! |
In that case, I shall bring this to the attention of Commodore WDL ( @cjllanwarne )! |
I'd encourage interested parties (e.g. @drozen) to directly interface w/ OpenWDL |
What does that mean? How do we get you guys to prioritize work on it? |
@eitanbanks To modify the WDL requires a change to the WDL spec. There's a process that's not directly controlled by the Cromwell team (or the Broad): https://github.com/openwdl/wdl/blob/master/GOVERNANCE.md#rfc-process IOW on our side of the fence we can try to work on getting it in but there's no guarantee of it ever happening even if we want it to (note: that's a worst case, I don't expect that'd really happen) |
The other relavent bit I should mention is that there's a separate WDL team starting next quarter (currently just @cjllanwarne but hopefully by then it'll be 2 ppl) - we've been talking about managing it in a manner similar to the field engineering team but that's still kind of up in the air at the moment. |
Right, understood about the WDL spec being open. But it's not helpful for us to have to advocate to an external committee to get features added in. The hope is that this team (or the new WDL team) can represent us in those discussions. Our goal is to try to convince Kate that this is important enough to prioritize. |
Sure, but the second point I made is that that's also likely not how prioritization will work for the WDL team |
So bribery, like with Field Eng? |
Seriously: What is the prioritization process here?
Not seriously: Do I need to go to an ATM?
…On Tue, Nov 14, 2017 at 4:30 PM, Eric Banks ***@***.***> wrote:
So bribery, like with Field Eng?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1991 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACDXk2zeQ6zcnmqLNr8NBIKE8ZPJSFI3ks5s2gaTgaJpZM4MCdgW>
.
--
Lee Lichtenstein
Broad Institute
75 Ames Street, Room 8011A
Cambridge, MA 02142
617 714 8632
|
Methods and Ops could put together a bribe package if necessary... |
We should stick to our respected roles. You put it together and we will ensure it gets delivered. |
This would be extremely helpful. |
Ping! We still would like this. Is it on the roadmap? |
Hey, is there any info on this ticket? It is really desirable and it is hanging for two years now. |
@guma44 we have moved to Jira and this ticket is in review at the moment on our new board. Please check out this ticket https://broadworkbench.atlassian.net/browse/BA-5933 or this Pull Request for updates #5180 |
The PR that merged addresses some, but not all of the scope of this issue. It specifically targets running out of memory. I believe it may lay a helpful foundation for future work, however. To set some expectations @guma44 we get many more issues than we have time to work on, and our institutional sponsors (who pay the rent) get first dibs on selecting the most important ones. We are always happy to consider PRs if you feel like contributing yourself. |
@gemmalam, @aednichols Thanks for reply. I see the PR very promising for our setup. It would probably solve our problems for now. Concerning, contribution I would do this if I knew more Java/Scala but I have not been programming in those languages for years. |
Briefest of discussions with Jose. NOTE: All naming up in the air.
Enable a runtime attribute such as
retryOnStderrPattern
that populates a valueretryAttempt
/retry
/retryCount
/retry_count
/etc. This will enable tasks such as:When the stderr contains the specified regular expression pattern, the job should be retried with the counter incremented.
Not discussed afaik, how to limit the number of attempts: another runtime attribute, a backend config value, both, other?
The text was updated successfully, but these errors were encountered: