AttributeError: 'ScalingTensor' object has no attribute 'view' #180

LSC527 · 2024-07-25T07:15:51Z

What's the issue, what's expected?:
Error when using ms-amp to do llm sft.
ms-amp deepspeed config:
"msamp": {
"enabled": true,
"opt_level": "O1|O2|O3", # all tried
"use_te": false
}

How to reproduce it?:
Follow the setup of DeepSpeed-Chat, and do some small code modify to enable ms-amp in DeepSpeed-Chat/training/step1_supervised_finetuning/main.py:

line 20 modify: import deepspeed -> from msamp import deepspeed

line 230 add:
ds_config["msamp"] = {
"enabled": True,
"opt_level": "O1|O2|O3",
"use_te": False
}

Log message or shapshot?:

Traceback (most recent call last):
  File "/home/work/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 400, in <module>
    main()
  File "/home/work/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 369, in main
    model.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/engine.py", line 405, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 951, in backward
    super().backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2040, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 491, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 288, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/msamp/nn/functional.py", line 123, in backward
    ctx.weight.backward_grad_update(wgrad)
  File "/usr/local/lib/python3.10/dist-packages/msamp/common/tensor/tensor.py", line 130, in backward_grad_update
    self._backward_post_hooks(grad)
  File "/usr/local/lib/python3.10/dist-packages/msamp/common/tensor/hook.py", line 47, in __call__
    hook(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1581, in _call_impl
    hook_result = hook(self, args, result)
  File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 386, in reduce_partition_and_remove_grads
    self.fp8_reduce_ready_partitions_and_remove_grads(param, i)
  File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 595, in fp8_reduce_ready_partitions_and_remove_grads
    self.fp8_reduce_independent_p_g_buckets_and_remove_grads(param, i)
  File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 412, in fp8_reduce_independent_p_g_buckets_and_remove_grads
    self.fp8_reduce_ipg_grads()
  File "/usr/local/lib/python3.10/dist-packages/msamp/deepspeed/runtime/zero/fp8_stage_1_and_2.py", line 541, in fp8_reduce_ipg_grads
    self.fp8_average_tensor(self.fp8_extra_large_param_to_reduce.grad.view(-1))
AttributeError: 'ScalingTensor' object has no attribute 'view'

Additional information:
env: ghcr.io/azure/msamp:v0.4.0-cuda12.2
gpu: h100 * 8

wkcn · 2024-07-26T01:11:25Z

@LSC527
Thank you for pointing the bug out!

A temporary solution is to increase the reduce_bucket_size of zero_optimization in deepspeed config. It can avoid large tensor reduction.

"zero_optimization": {
    "reduce_bucket_size": 5e8,
},

LSC527 · 2024-07-26T03:09:45Z

@wkcn Thx. Increase reduce_bucket_size can run now but shows no throughput gain compared to fp16.

wkcn · 2024-07-29T01:25:46Z

@LSC527
FP8 accelerates the training significantly when the model is relatively large (> 6B parameters). MS-AMP can reduce the memory usage to enable a larger batch size. And it can be cooperated with TransformerEngine to improve the FP8 training speed .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'ScalingTensor' object has no attribute 'view' #180

AttributeError: 'ScalingTensor' object has no attribute 'view' #180

LSC527 commented Jul 25, 2024

wkcn commented Jul 26, 2024

LSC527 commented Jul 26, 2024

wkcn commented Jul 29, 2024

AttributeError: 'ScalingTensor' object has no attribute 'view' #180

AttributeError: 'ScalingTensor' object has no attribute 'view' #180

Comments

LSC527 commented Jul 25, 2024

wkcn commented Jul 26, 2024

LSC527 commented Jul 26, 2024

wkcn commented Jul 29, 2024