Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore non-AiiDA process task from checkpoint #242

Closed
superstar54 opened this issue Aug 19, 2024 · 2 comments · Fixed by #245
Closed

Restore non-AiiDA process task from checkpoint #242

superstar54 opened this issue Aug 19, 2024 · 2 comments · Fixed by #245
Assignees
Labels
bug Something isn't working

Comments

@superstar54
Copy link
Member

superstar54 commented Aug 19, 2024

If the daemon stops when a non-AiiDA process task is running, after we start the daemon, the running task will hang up.

non-AiiDA process includes:

  • task decorated by @task

  • while

    task["execution_count"] -= 1
  • if task

@superstar54 superstar54 added the bug Something isn't working label Aug 19, 2024
@superstar54 superstar54 self-assigned this Aug 19, 2024
@superstar54 superstar54 changed the title Restore normal function from checkpoint Restore non-AiiDA process task from checkpoint Aug 19, 2024
@superstar54
Copy link
Member Author

superstar54 commented Aug 19, 2024

Submit a while loop, and stop and start the daemon, get this error.

2024-08-19 20:12:44 [169487 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: Continue workgraph.
2024-08-19 20:12:44 [169488 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: tasks ready to run: while3
2024-08-19 20:12:44 [169489 | REPORT]: [118860|WorkGraphEngine|run_tasks]: Run task: while3, type: WHILE
2024-08-19 20:12:44 [169490 | REPORT]: [118860|WorkGraphEngine|run_tasks]: While Task while3: Condition not fullilled, task finished. Skip all its children.
2024-08-19 20:12:46 [169491 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: Continue workgraph.
2024-08-19 20:12:46 [169492 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: tasks ready to run: add12
2024-08-19 20:12:46 [169493 | REPORT]: [118860|WorkGraphEngine|run_tasks]: Run task: add12, type: CALCFUNCTION
2024-08-19 20:12:47 [169494 | REPORT]: [118860|WorkGraphEngine|update_task_state]: Task: add12 finished.
2024-08-19 20:12:47 [169495 | REPORT]: [118860|WorkGraphEngine|update_while_task_state]: Wihle Task while1: this iteration finished. Try to reset for the next iteration.
2024-08-19 20:12:48 [169496 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: Continue workgraph.
2024-08-19 20:12:48 [169497 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: tasks ready to run: compare1
2024-08-19 20:12:48 [169498 | REPORT]: [118860|WorkGraphEngine|run_tasks]: Run task: compare1, type: CALCFUNCTION
2024-08-19 20:13:17 [169499 | REPORT]: [118860|WorkGraphEngine|continue_workgraph]: Continue workgraph.
2024-08-19 20:13:18 [169500 | REPORT]: [118860|WorkGraphEngine|on_except]: Traceback (most recent call last):
  File "/home/xing/miniconda3/envs/aiida/lib/python3.11/site-packages/plumpy/process_states.py", line 228, in execute
    result = self.run_fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xing/repos/superstar54/aiida-workgraph/aiida_workgraph/engine/workgraph.py", line 308, in _do_step
    self.continue_workgraph()
  File "/home/xing/repos/superstar54/aiida-workgraph/aiida_workgraph/engine/workgraph.py", line 645, in continue_workgraph
    if ready and self.task_should_run(name):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xing/repos/superstar54/aiida-workgraph/aiida_workgraph/engine/workgraph.py", line 913, in task_should_run
    index = [i for i, item in enumerate(name_and_uuids) if item[1] == uuid][0]
            ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

@superstar54 superstar54 linked a pull request Aug 20, 2024 that will close this issue
@superstar54
Copy link
Member Author

Here's more about the checkpoint and restore. One can not restore from where the WorkGraph engine fails; instead, we restore from the checkpoint.

Case 1: A while task is running, and its execution_count is increased by one. The daemon stops. In the checkpoint, the while task is not running, and its execution_count is not increased by one. The daemon restarts, and the while task is ready to run and run again. Thus, we don't need to modify the execution_count.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant