Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'ScalingTensor' object has no attribute 'view' #180

Open
LSC527 opened this issue Jul 25, 2024 · 3 comments
Open

AttributeError: 'ScalingTensor' object has no attribute 'view' #180

LSC527 opened this issue Jul 25, 2024 · 3 comments

Comments

@LSC527
Copy link

LSC527 commented Jul 25, 2024

What's the issue, what's expected?:
Error when using ms-amp to do llm sft.
ms-amp deepspeed config:
"msamp": {
"enabled": true,
"opt_level": "O1|O2|O3", # all tried
"use_te": false
}

How to reproduce it?:
Follow the setup of DeepSpeed-Chat, and do some small code modify to enable ms-amp in DeepSpeed-Chat/training/step1_supervised_finetuning/main.py:

line 20 modify: import deepspeed -> from msamp import deepspeed

line 230 add:
ds_config["msamp"] = {
"enabled": True,
"opt_level": "O1|O2|O3",
"use_te": False
}

Log message or shapshot?:

Traceback (most recent call last):
  File "/home/work/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 400, in <module>
    main()
  File "/home/work/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 369, in main
    model.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/engine.py", line 405, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 951, in backward
    super().backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2040, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 491, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/msamp/nn/functional.py", line 123, in backward
    ctx.weight.backward_grad_update(wgrad)
  File "/usr/local/lib/python3.10/dist-packages/msamp/common/tensor/tensor.py", line 130, in backward_grad_update
    self._backward_post_hooks(grad)
  File "/usr/local/lib/python3.10/dist-packages/msamp/common/tensor/hook.py", line 47, in __call__
    hook(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1581, in _call_impl
    hook_result = hook(self, args, result)
  File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 386, in reduce_partition_and_remove_grads
    self.fp8_reduce_ready_partitions_and_remove_grads(param, i)
  File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 595, in fp8_reduce_ready_partitions_and_remove_grads
    self.fp8_reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 412, in fp8_reduce_independent_p_g_buckets_and_remove_grads
    self.fp8_reduce_ipg_grads()
  File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 541, in fp8_reduce_ipg_grads
    self.fp8_average_tensor(self.fp8_extra_large_param_to_reduce.grad.view(-1))
AttributeError: 'ScalingTensor' object has no attribute 'view'

Additional information:
env: ghcr.io/azure/msamp:v0.4.0-cuda12.2
gpu: h100 * 8

@wkcn
Copy link
Contributor

wkcn commented Jul 26, 2024

@LSC527
Thank you for pointing the bug out!

A temporary solution is to increase the reduce_bucket_size of zero_optimization in deepspeed config. It can avoid large tensor reduction.

"zero_optimization": {
    "reduce_bucket_size": 5e8,
},

@LSC527
Copy link
Author

LSC527 commented Jul 26, 2024

@wkcn Thx. Increase reduce_bucket_size can run now but shows no throughput gain compared to fp16.

@wkcn
Copy link
Contributor

wkcn commented Jul 29, 2024

@LSC527
FP8 accelerates the training significantly when the model is relatively large (> 6B parameters). MS-AMP can reduce the memory usage to enable a larger batch size. And it can be cooperated with TransformerEngine to improve the FP8 training speed .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants