Error once speed up training #81

moeinheidari7829 · 2024-05-27T00:12:23Z

Dear authors, thank you for the great work and open source code,

I am training the model on own dataset, however once I train with the following arguments as True (to speed up training), I get the following error (When I make these commands False, the model trains with no error):

SPEED UP COMMANS:

use_compile: True
mixed_precision: True
enable_xformers_memory_efficient_attention: True
gradient_checkpointing: True

Error:

Traceback (most recent call last):
File "/scratch/st-ilker-1/moein/code/Latte/train.py", line 285, in
main(OmegaConf.load(args.config))
File "/scratch/st-ilker-1/moein/code/Latte/train.py", line 162, in main
update_ema(ema, model.module) #, decay=0) # Ensure EMA is initialized with synced weights
File "/project/st-ilker-1/moein/moein-envs/latte-env/lib/python3.8/site-packages/torch/utils/contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/scratch/st-ilker-1/moein/code/Latte/utils.py", line 200, in update_ema
ema_params[name].mul(decay).add_(param.data, alpha=1 - decay)
KeyError: '_orig_mod.pos_embed'
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47691 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47692 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47694 closing signal SIGTERM
Traceback (most recent call last):
sys.exit(main())
return f(*args, **kwargs)
return launch_agent(self._config, self._entrypoint, list(args))
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Has any one had the same experience?

moeinheidari7829 · 2024-05-27T00:27:07Z

Update:

The command that is making the problem is : use_compile

maxin-cn · 2024-05-27T23:47:19Z

Update:

The command that is making the problem is : use_compile

Thank you for your interest. In the Latte code, I provide some apis that can accelerate training, but I don't check their correctness. torch.compile sets a module for the model, so the Pytorch class will be different between ema and model. If you solve these problems, we welcome your PR~

maxin-cn added the enhancement New feature or request label May 27, 2024

Yaofang-Liu mentioned this issue Sep 5, 2024

Fix speed up training issue #121

Merged

maxin-cn closed this as completed Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error once speed up training #81

Error once speed up training #81

moeinheidari7829 commented May 27, 2024

moeinheidari7829 commented May 27, 2024

maxin-cn commented May 27, 2024 •

edited

Loading

Error once speed up training #81

Error once speed up training #81

Comments

moeinheidari7829 commented May 27, 2024

moeinheidari7829 commented May 27, 2024

maxin-cn commented May 27, 2024 • edited Loading

maxin-cn commented May 27, 2024 •

edited

Loading