Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine the error message when auto_scale_lr is not set correctly #1181

Merged
merged 5 commits into from
Jun 6, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 12 additions & 7 deletions mmengine/runner/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -1975,16 +1975,21 @@ def resume(self,
if (previous_gpu_ids is not None and len(previous_gpu_ids) > 0
and len(previous_gpu_ids) != self._world_size):
# TODO, should we modify the iteration?
self.logger.info(
'Number of GPU used for current experiment is not '
'consistent with resuming from checkpoint')
if (self.auto_scale_lr is None
or not self.auto_scale_lr.get('enable', False)):
raise RuntimeError(
'Cannot automatically rescale lr in resuming. Please '
'make sure the number of GPU is consistent with the '
'previous training state resuming from the checkpoint '
'or set `enable` in `auto_scale_lr to False.')
'Number of GPUs used for current experiment is not '
'consistent with the checkpoint being resumed from. '
'This will result in poor performance due to the '
'learning rate. You must set the '
'`auto_scale_lr` parameter for Runner and make '
'`auto_scale_lr["enable"]=True`.')
else:
self.logger.info(
'Number of GPU used for current experiment is not '
'consistent with resuming from checkpoint but the '
'leaning rate will be adjusted according to the '
f'setting in auto_scale_lr={self.auto_scale_lr}')

# resume random seed
resumed_seed = checkpoint['meta'].get('seed', None)
Expand Down