Skip to content

Commit

Permalink
[Enhancement] Support deterministic training in benchmark (#1356)
Browse files Browse the repository at this point in the history
* support deterministic training in benchmark

* add kill-on-bad-exit to benchmark
  • Loading branch information
LeoXing1996 committed Oct 31, 2022
1 parent 6d83fb6 commit 3375fff
Show file tree
Hide file tree
Showing 2 changed files with 21 additions and 2 deletions.
10 changes: 9 additions & 1 deletion .dev_scripts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,15 @@ python .dev_scripts/train_benchmark.py mm_lol \

Specifically, `--rerun-fail` and `--rerun-cancel` can be used together to rerun both failed and cancaled jobs.

## 8. Automatically check links
## 8. `deterministic` training

Set `torch.backends.cudnn.deterministic = True` and `torch.backends.cudnn.benchmark = False` can remove randomness operation in Pytorch training. You can add `--deterministic` flag when start your benchmark training to remove the influence of randomness operation.

```shell
python .dev_scripts/train_benchmark.py mm_lol --job-name xzn --models pix2pix --cpus-per-job 16 --run --deterministic
```

## 9. Automatically check links

Use the following script to check whether the links in documentations are valid:

Expand Down
13 changes: 12 additions & 1 deletion .dev_scripts/train_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,10 @@ def parse_args():
'--work-dir',
default='work_dirs/benchmark_train',
help='the dir to save metric')
parser.add_argument(
'--deterministic',
action='store_true',
help='Whether set `deterministic` during training.')
parser.add_argument(
'--run', action='store_true', help='run script directly')
parser.add_argument(
Expand Down Expand Up @@ -239,10 +243,14 @@ def create_train_job_batch(commands, model_info, args, port, script_name):
job_script += (f'#SBATCH --gres=gpu:{n_gpus}\n'
f'#SBATCH --ntasks-per-node={min(n_gpus, 8)}\n'
f'#SBATCH --ntasks={n_gpus}\n'
f'#SBATCH --cpus-per-task={args.cpus_per_job}\n\n')
f'#SBATCH --cpus-per-task={args.cpus_per_job}\n'
f'#SBATCH --kill-on-bad-exit=1\n\n')
else:
job_script += '\n\n' + 'export CUDA_VISIBLE_DEVICES=-1\n'

if args.deterministic:
job_script += 'export CUBLAS_WORKSPACE_CONFIG=:4096:8\n'

job_script += (f'export MASTER_PORT={port}\n'
f'{runner} -u {script_name} {config} '
f'--work-dir={work_dir} '
Expand All @@ -254,6 +262,9 @@ def create_train_job_batch(commands, model_info, args, port, script_name):
if args.amp:
job_script += ' --amp '

if args.deterministic:
job_script += ' --cfg-options randomness.deterministic=True'

job_script += '\n'

with open(work_dir / 'job.sh', 'w') as f:
Expand Down

0 comments on commit 3375fff

Please sign in to comment.