Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error once speed up training #81

Closed
moeinheidari7829 opened this issue May 27, 2024 · 2 comments
Closed

Error once speed up training #81

moeinheidari7829 opened this issue May 27, 2024 · 2 comments
Labels
enhancement New feature or request

Comments

@moeinheidari7829
Copy link

Dear authors, thank you for the great work and open source code,

I am training the model on own dataset, however once I train with the following arguments as True (to speed up training), I get the following error (When I make these commands False, the model trains with no error):

SPEED UP COMMANS:

use_compile: True
mixed_precision: True
enable_xformers_memory_efficient_attention: True
gradient_checkpointing: True

Error:

Traceback (most recent call last):
File "/scratch/st-ilker-1/moein/code/Latte/train.py", line 285, in
main(OmegaConf.load(args.config))
File "/scratch/st-ilker-1/moein/code/Latte/train.py", line 162, in main
update_ema(ema, model.module) #, decay=0) # Ensure EMA is initialized with synced weights
File "/project/st-ilker-1/moein/moein-envs/latte-env/lib/python3.8/site-packages/torch/utils/contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/scratch/st-ilker-1/moein/code/Latte/utils.py", line 200, in update_ema
ema_params[name].mul
(decay).add_(param.data, alpha=1 - decay)
KeyError: '_orig_mod.pos_embed'
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47691 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47692 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 47694 closing signal SIGTERM
Traceback (most recent call last):
sys.exit(main())
return f(*args, **kwargs)
return launch_agent(self._config, self._entrypoint, list(args))
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Has any one had the same experience?

@moeinheidari7829
Copy link
Author

Update:

The command that is making the problem is : use_compile

@maxin-cn
Copy link
Collaborator

maxin-cn commented May 27, 2024

Update:

The command that is making the problem is : use_compile

Thank you for your interest. In the Latte code, I provide some apis that can accelerate training, but I don't check their correctness. torch.compile sets a module for the model, so the Pytorch class will be different between ema and model. If you solve these problems, we welcome your PR~

@maxin-cn maxin-cn added the enhancement New feature or request label May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants