Skip to content

Commit

Permalink
Fix querying slurm jobs after they're gone
Browse files Browse the repository at this point in the history
When a slurm job is cancelled out of Law's control and while Law is not
actively polling it, Law still thinks it is running. On the first poll, it will
then normally find out that the job has failed and restart it.

However, the slurm scheduler forgets about old jobs after some time. Trying to
query the status of such a job returns an error. Handle this by returning a
FAILED status for all jobs when squeue fails with "Invalid job id specified".
  • Loading branch information
lmoureaux committed Aug 19, 2022
1 parent 5018a76 commit 470073e
Showing 1 changed file with 5 additions and 2 deletions.
7 changes: 5 additions & 2 deletions law/contrib/slurm/job.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,8 +144,11 @@ def query(self, job_id, partition=None, silent=False):

# handle errors
if code != 0:
if silent:
return None
if "Invalid job id specified" in err:
return {id: self.job_status_dict(job_id=id, status=self.FAILED, error="slurm doesn't know about this job")
for id in job_id}
elif silent:
return None # FIXME: doing this seems to break BaseRemoteWorkflowProxy.poll
else:
raise Exception("queue query of slurm job(s) '{}' failed with code {}:"
"\n{}".format(job_id, code, err))
Expand Down

0 comments on commit 470073e

Please sign in to comment.