-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NeMo-UX] Support save_last="link"
#10548
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
logging.info(f'Scheduled async checkpoint save for {filepath}') | ||
else: | ||
finalize_fn() | ||
|
||
def _save_last_checkpoint(self, trainer: "pl.Trainer", monitor_candidates: Dict[str, torch.Tensor]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if there is some way to avoid overriding the whole method. This is always risky, since we lose touch with the upstream.
How is our flow different from the one in PTL which makes us add saved_current_step
logic and also not rely on self.last_model_path
?
Is it because PTL links to any available last checkpoint (not necessarily from the last iteration)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I made two changes:
- Made sure to add a symlink only when the current step was actually saved. As you suggested, PTL always links to the last checkpoint saved, which might not correspond to the latest step
- Added these lines which fix
last_model_path
saved to the*-last
checkpoint state dict when using symlinks
I'll think about whether we can make these fixes without overwriting the entire _save_last_checkpoint
method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Maybe overwriting save_last_checkpoint is inevitable in which case current version is ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running some final tests now, but I think I was able to avoid overwriting _save_last_checkpoint
. Please let me know if you have any concerns with the current approach
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks, this is great.
Do you know how last_model_path
is used during restart? I'm wondering if the loaded state dict will be valid if e.g. failure happens between the regular and "last" ckpt save
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think last_model_path
is only used when removing the previous -last
model to ensure we only retain a single -last
checkpoint. If failure happens between the regular and last checkpoint save, I don't think the state dict will be valid, but I also don't think this is a concern, because we'd end up restoring from the previously saved -last
checkpoint which does have the correct state dict.
Signed-off-by: ashors1 <ashors@nvidia.com>
…o ashors/symlink-last-ckpt
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
filepath = ckpt_to_dir(filepath) | ||
linkpath = ckpt_to_dir(linkpath) | ||
super()._link_checkpoint(trainer, filepath, linkpath) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to avoid overriding PTL's _link_checkpoint
method ? We want to avoid overriding PTL's private methods to have stable code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can think about it, but it might be challenging to support linking with async checkpointing without overwriting this method. Also, the addition of saved_current_step
is needed to fix a bug that seems to exist in PTL's link implementation in which the -last
checkpoint gets linked to the most recently saved checkpoint, even if that corresponds to a different step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks @ashors1 ! Then maybe we should file an issue with PTL issues and ask them to fix this bug ? That way it can save us from overriding private methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it would be great if PTL could fix that issue! But we'd still have to figure out how to handle async checkpointing, and I do think that would require us to overwrite either _link_checkpoint
or _save_last_checkpoint
(where _link_checkpoint
is invoked)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that for handling async save we have to override _link_checkpoint
anyway.
But since we call super()._link_checkpoint
I don't think there is too much risk connected with that
Signed-off-by: ashors1 <ashors@nvidia.com>
…o ashors/symlink-last-ckpt
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
@@ -0,0 +1,143 @@ | |||
import os | |||
from dataclasses import dataclass |
Check notice
Code scanning / CodeQL
Unused import Note test
import pytest | ||
import pytorch_lightning as pl | ||
import torch | ||
from megatron.core import ModelParallelConfig |
Check notice
Code scanning / CodeQL
Unused import Note test
from pytorch_lightning.utilities.types import EVAL_DATALOADERS, TRAIN_DATALOADERS | ||
|
||
import nemo.lightning as nl | ||
from nemo.collections import llm |
Check notice
Code scanning / CodeQL
Unused import Note test
model = ExampleModel() | ||
|
||
data = RandomDataset(32, 64) | ||
save_top_k = 3 |
Check notice
Code scanning / CodeQL
Unused local variable Note test
use_datetime_version=False, | ||
) | ||
|
||
strategy = nl.MegatronStrategy(ckpt_async_save=True, replace_progress_bar=False) |
Check notice
Code scanning / CodeQL
Unused local variable Note test
What does this PR do ?
Adds support for creating a symlink for
-last
checkpoints. Implementation is compatible with synchronous and asynchronous checkpointing.Collection: llm
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information