Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm timeouts that happen during MiniWDL container runs are not detected as timeouts #5073

Open
adamnovak opened this issue Aug 21, 2024 · 0 comments

Comments

@adamnovak
Copy link
Member

adamnovak commented Aug 21, 2024

Cecilia reported a workflow run where the jobs were printing a message from MiniWDL saying that they saw signal 2, and failing.

I think they are getting the timeout signal from Slurm, but MiniWDL's signal handler is replacing the one we install/the default Python one and not letting us see the timeout signal that we expect to make the worker actually fail the job in a way we recognize as a timeout. So we don't get any of the useful user-facing timeout logging and the user thinks the job is actually failing and not just timing out.

We either need to hack MiniWDL's signal handlers, or detect when MinIWDL is raising its WDL.runtime.error.Terminated exception and treat it as a timeout (at least under Slurm).

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1637

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant