Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script run_mlm_no_trainer.py error #15081

Closed
2 tasks done
cyk1337 opened this issue Jan 9, 2022 · 5 comments
Closed
2 tasks done

Script run_mlm_no_trainer.py error #15081

cyk1337 opened this issue Jan 9, 2022 · 5 comments

Comments

@cyk1337
Copy link

cyk1337 commented Jan 9, 2022

Environment info

  • transformers version: 4.14.0.dev0
  • Platform: Linux-3.10.0_3-0-0-12-x86_64-with-centos-6.3-Final
  • Python version: 3.7.11
  • PyTorch version (GPU?): 1.7.1 (True)
  • Tensorflow version (GPU?): 2.7.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.3.6 (cpu)
  • Jax version: 0.2.26
  • JaxLib version: 0.1.75
  • Using GPU in script?: Y
  • Using distributed or parallel set-up in script?: Y

Who can help

@patrickvonplaten @LysandreJik

Information

Model I am using: roberta-base

The problem arises when using:

The tasks I am working on is:

  • an official pre-training task: run the mlm pre-training script.

To reproduce

Steps to reproduce the behavior:

Following the official instruction at python run_mlm_no_trainer.py

python run_mlm_no_trainer.py \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --model_name_or_path roberta-base \
    --output_dir /tmp/test-mlm

Expected behavior

[INFO|trainer.py:1204] 2022-01-09 20:51:14,185 >> ***** Running training *****
[INFO|trainer.py:1205] 2022-01-09 20:51:14,185 >>   Num examples = 4650
[INFO|trainer.py:1206] 2022-01-09 20:51:14,185 >>   Num Epochs = 3
[INFO|trainer.py:1207] 2022-01-09 20:51:14,185 >>   Instantaneous batch size per device = 8
[INFO|trainer.py:1208] 2022-01-09 20:51:14,186 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1209] 2022-01-09 20:51:14,186 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:1210] 2022-01-09 20:51:14,186 >>   Total optimization steps = 219
  0%|                                                                                                   | 0/219 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/xxx/.vscode-server/extensions/ms-python.python-2021.1.502429796/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/home/xxx/.vscode-server/extensions/ms-python.python-2021.1.502429796/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
    run()
  File "/home/xxx/.vscode-server/extensions/ms-python.python-2021.1.502429796/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
    runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/xxx/transformers/examples/pytorch/demo/run_mlm.py", line 556, in <module>
    main()
  File "/home/xxx/transformers/examples/pytorch/demo/run_mlm.py", line 505, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/xxx/transformers/src/transformers/trainer.py", line 1325, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/xxx/transformers/src/transformers/trainer.py", line 1884, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/xxx/transformers/src/transformers/trainer.py", line 1916, in compute_loss
    outputs = model(**inputs)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 161, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 171, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
    output.reraise()
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xxx/transformers/src/transformers/models/roberta/modeling_roberta.py", line 1108, in forward
    return_dict=return_dict,
  File "/home/xxx/anaconda3/envs/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xxx/transformers/src/transformers/models/roberta/modeling_roberta.py", line 819, in forward
    buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (1024) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [8, 1024].  Tensor sizes: [1, 514]
@LysandreJik
Copy link
Member

cc @sgugger

@sgugger
Copy link
Collaborator

sgugger commented Jan 10, 2022

Which command are you running exactly? The logs you produce use distributed training whereas the command you told us (which runs successfully on my side) launches the script with python.

@cyk1337
Copy link
Author

cyk1337 commented Jan 11, 2022

I just rerun it on another machine but got the same issue.

The exact command is:

$ python run_mlm_no_trainer.py --model_name_or_path=./roberta-base --dataset_name=wikitext --dataset_config_name=wikitext-2-raw-v1 --output_dir=./test_mlm_out

where ./roberta-base directory contains:

 $ ls roberta-base/
config.json  merges.txt  pytorch_model.bin  vocab.json

The output was:

01/11/2022 11:59:36 - INFO - __main__ - ***** Running training *****
01/11/2022 11:59:36 - INFO - __main__ -   Num examples = 2390
01/11/2022 11:59:36 - INFO - __main__ -   Num Epochs = 3
01/11/2022 11:59:36 - INFO - __main__ -   Instantaneous batch size per device = 8
01/11/2022 11:59:36 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 8
01/11/2022 11:59:36 - INFO - __main__ -   Gradient Accumulation steps = 1
01/11/2022 11:59:36 - INFO - __main__ -   Total optimization steps = 897
  0%|                                                                                                                                                                                         | 0/897 [00:00<?, ?it/s]Traceback (most recent call last):
  File "run_mlm_no_trainer.py", line 566, in <module>
    main()
  File "run_mlm_no_trainer.py", line 513, in main
    outputs = model(**batch)
  File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 1106, in forward
    return_dict=return_dict,
  File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/root/xx/workspace/env_run/accelerate_test/torch1.7/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 817, in forward
    buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
RuntimeError: The expanded size of the tensor (1024) must match the existing size (514) at non-singleton dimension 1.  Target sizes: [8, 1024].  Tensor sizes: [1, 514]
  0%|                                                                                                                                                                                         | 0/897 [00:00<?, ?it/s]

Possible Solution
The issue reported was due to the last dim mismatch between the target size (1024) and tensor size (514) oftoken_type_ids. I suspect this is caused by unspecified --max_seq_length=512. With additional argument --max_seq_length=512, it works. Is it correct?

@sgugger
Copy link
Collaborator

sgugger commented Jan 11, 2022

I have no idea what the content of your roberta-base folder is, but your addition is probably correct. It works with the official checkpoint, where the model specifies a max length the script then uses, maybe it's the part missing in your local checkpoint.

@cyk1337
Copy link
Author

cyk1337 commented Jan 11, 2022

Yeah you are correct. The checkpoint that the official script downloaded works. There might be something mismatched in my cached roberta-base folder (just manually downloaded from AWS, probability not newest ones). Thank you for pointing out this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants