[NeMo-UX] Support `save_last="link"` #10548

ashors1 · 2024-09-20T01:02:03Z

What does this PR do ?

Adds support for creating a symlink for -last checkpoints. Implementation is compatible with synchronous and asynchronous checkpointing.

Collection: llm

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: ashors1 <ashors@nvidia.com>

Signed-off-by: Anna Shors <ashors@nvidia.com>

Signed-off-by: ashors1 <ashors@nvidia.com>

…t-ckpt

Signed-off-by: ashors1 <ashors@nvidia.com>

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

nemo/lightning/pytorch/callbacks/model_checkpoint.py

mikolajblaz · 2024-09-23T13:37:36Z

nemo/lightning/pytorch/callbacks/model_checkpoint.py

                logging.info(f'Scheduled async checkpoint save for {filepath}')
            else:
                finalize_fn()

+    def _save_last_checkpoint(self, trainer: "pl.Trainer", monitor_candidates: Dict[str, torch.Tensor]) -> None:


I'm wondering if there is some way to avoid overriding the whole method. This is always risky, since we lose touch with the upstream.

How is our flow different from the one in PTL which makes us add saved_current_step logic and also not rely on self.last_model_path?
Is it because PTL links to any available last checkpoint (not necessarily from the last iteration)?

Yeah, I made two changes:

Made sure to add a symlink only when the current step was actually saved. As you suggested, PTL always links to the last checkpoint saved, which might not correspond to the latest step

Added these lines which fix last_model_path saved to the *-last checkpoint state dict when using symlinks

I'll think about whether we can make these fixes without overwriting the entire _save_last_checkpoint method

Thanks. Maybe overwriting save_last_checkpoint is inevitable in which case current version is ok

Running some final tests now, but I think I was able to avoid overwriting _save_last_checkpoint. Please let me know if you have any concerns with the current approach

Great, thanks, this is great.

Do you know how last_model_path is used during restart? I'm wondering if the loaded state dict will be valid if e.g. failure happens between the regular and "last" ckpt save

I think last_model_path is only used when removing the previous -last model to ensure we only retain a single -last checkpoint. If failure happens between the regular and last checkpoint save, I don't think the state dict will be valid, but I also don't think this is a concern, because we'd end up restoring from the previously saved -last checkpoint which does have the correct state dict.

nemo/lightning/pytorch/callbacks/model_checkpoint.py

Signed-off-by: ashors1 <ashors@nvidia.com>

…o ashors/symlink-last-ckpt

Signed-off-by: ashors1 <ashors@nvidia.com>

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

athitten · 2024-09-23T18:27:25Z

nemo/lightning/pytorch/callbacks/model_checkpoint.py

+        filepath = ckpt_to_dir(filepath)
+        linkpath = ckpt_to_dir(linkpath)
+        super()._link_checkpoint(trainer, filepath, linkpath)
+


Is there a way to avoid overriding PTL's _link_checkpoint method ? We want to avoid overriding PTL's private methods to have stable code.

I can think about it, but it might be challenging to support linking with async checkpointing without overwriting this method. Also, the addition of saved_current_step is needed to fix a bug that seems to exist in PTL's link implementation in which the -last checkpoint gets linked to the most recently saved checkpoint, even if that corresponds to a different step.

I see, thanks @ashors1 ! Then maybe we should file an issue with PTL issues and ask them to fix this bug ? That way it can save us from overriding private methods.

Yeah it would be great if PTL could fix that issue! But we'd still have to figure out how to handle async checkpointing, and I do think that would require us to overwrite either _link_checkpoint or _save_last_checkpoint (where _link_checkpoint is invoked)

Agree that for handling async save we have to override _link_checkpoint anyway.
But since we call super()._link_checkpoint I don't think there is too much risk connected with that

Signed-off-by: ashors1 <ashors@nvidia.com>

…o ashors/symlink-last-ckpt

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

tests/lightning/pytorch/callbacks/test_model_checkpoint.py

@@ -0,0 +1,143 @@
+import os
+from dataclasses import dataclass


tests/lightning/pytorch/callbacks/test_model_checkpoint.py

+import pytest
+import pytorch_lightning as pl
+import torch
+from megatron.core import ModelParallelConfig


tests/lightning/pytorch/callbacks/test_model_checkpoint.py

+from pytorch_lightning.utilities.types import EVAL_DATALOADERS, TRAIN_DATALOADERS
+
+import nemo.lightning as nl
+from nemo.collections import llm


tests/lightning/pytorch/callbacks/test_model_checkpoint.py

+        model = ExampleModel()
+
+        data = RandomDataset(32, 64)
+        save_top_k = 3


tests/lightning/pytorch/callbacks/test_model_checkpoint.py

+            use_datetime_version=False,
+        )
+
+        strategy = nl.MegatronStrategy(ckpt_async_save=True, replace_progress_bar=False)


ashors1 added 11 commits September 10, 2024 12:52

provide support for save_last='link'

86ae11e

Signed-off-by: ashors1 <ashors@nvidia.com>

fix symlinks when top_k checkpoint not saved

0f01d49

Signed-off-by: ashors1 <ashors@nvidia.com>

support symlinks with async checkpointing

a953982

Signed-off-by: ashors1 <ashors@nvidia.com>

only unlink on rank 0

b752074

Signed-off-by: Anna Shors <ashors@nvidia.com>

fix race condition

3d62934

Signed-off-by: ashors1 <ashors@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/NeMo into ashors/symlink-las…

8ef9f8c

…t-ckpt

force linked checkpoint to correspond to last finalized checkpoint

ec9a92c

Signed-off-by: ashors1 <ashors@nvidia.com>

fix last_model_path after restore

d72a838

Signed-off-by: ashors1 <ashors@nvidia.com>

move symlink removal to strategy

4e7b3a0

Signed-off-by: ashors1 <ashors@nvidia.com>

remove unneeded lines

cdcaeb7

Signed-off-by: ashors1 <ashors@nvidia.com>

add some more documentation

ca1d6ee

Signed-off-by: ashors1 <ashors@nvidia.com>

ashors1 requested review from hemildesai and mikolajblaz September 20, 2024 01:02

Apply isort and black reformatting

200cae3

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

ashors1 added the Run CICD label Sep 20, 2024

mikolajblaz reviewed Sep 23, 2024

View reviewed changes

ashors1 and others added 7 commits September 23, 2024 09:19

address some comments

dec10ea

Signed-off-by: ashors1 <ashors@nvidia.com>

Merge branch 'ashors/symlink-last-ckpt' of github.com:NVIDIA/NeMo int…

0cc6846

…o ashors/symlink-last-ckpt

fix syntax

9b213b7

Signed-off-by: ashors1 <ashors@nvidia.com>

avoid overwriting _save_last_checkpoint

da26e98

Signed-off-by: ashors1 <ashors@nvidia.com>

fix base call

f80f492

Signed-off-by: ashors1 <ashors@nvidia.com>

small fix

bc322f6

Signed-off-by: ashors1 <ashors@nvidia.com>

Apply isort and black reformatting

1aa6bf1

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

athitten reviewed Sep 23, 2024

View reviewed changes

ashors1 and others added 3 commits September 23, 2024 22:36

add test for save_last=link

0c7f066

Signed-off-by: ashors1 <ashors@nvidia.com>

Merge branch 'ashors/symlink-last-ckpt' of github.com:NVIDIA/NeMo int…

b5ce506

…o ashors/symlink-last-ckpt

Apply isort and black reformatting

dae8c1f

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

github-advanced-security bot found potential problems Sep 24, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NeMo-UX] Support `save_last="link"` #10548

[NeMo-UX] Support `save_last="link"` #10548

ashors1 commented Sep 20, 2024 •

edited

Loading

mikolajblaz Sep 23, 2024

ashors1 Sep 23, 2024

mikolajblaz Sep 23, 2024

ashors1 Sep 23, 2024

mikolajblaz Sep 24, 2024 •

edited

Loading

ashors1 Sep 24, 2024

athitten Sep 23, 2024

ashors1 Sep 23, 2024

athitten Sep 23, 2024 •

edited

Loading

ashors1 Sep 23, 2024

mikolajblaz Sep 24, 2024

[NeMo-UX] Support save_last="link" #10548

Are you sure you want to change the base?

[NeMo-UX] Support save_last="link" #10548

Conversation

ashors1 commented Sep 20, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mikolajblaz Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

athitten Sep 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[NeMo-UX] Support `save_last="link"` #10548

[NeMo-UX] Support `save_last="link"` #10548

ashors1 commented Sep 20, 2024 •

edited

Loading

mikolajblaz Sep 24, 2024 •

edited

Loading

athitten Sep 23, 2024 •

edited

Loading