[Bug Fixed] scale=INF when casting a tensor to scaling FP32/BF16 tensors #131

wkcn · 2023-11-18T05:52:22Z

Description
The following script triggers the bug.

import torch
from msamp.common.dtype import Dtypes
torch.tensor([1.0 / 512], device='cuda').cast(Dtypes.kfloat32)

It will return ScalingTensor(tensor([nan], device='cuda:0'), meta=ScalingMeta(qtype=QType(name='kFloat32', value=2), scale=inf, scale_inv=0, amax=0.00195312, window_size=1).

The reason is that the scaling factor is stored as a FP32 scalar and it is equal to the maximum representation value / the maximum absolute value.

When the maximum absolute value is less than 1.0, the scaling factor will be larger than the maximum representation value of FP32, so the scaling factor is INF and its inverse scale_inv is 0.

To address the issue, the scaling factor is set to 1 when the data type is FP32 or BF16. After fixing the bug, it outputs

ScalingTensor(tensor([0.0020], device='cuda:0'), meta=ScalingMeta(qtype=QType(name='kFloat32', value=2), scale=1, scale_inv=1, amax=0.00195312, window_size=1)

tocean · 2023-11-27T02:25:48Z

Looks like the scaling factor will be always 1 for bf16 and float32. Do we still need to compute amax in TypeCast::cast_to_fp16 for bf16 and fp32? Compute amax may be time consuming.

wkcn · 2023-11-27T02:36:50Z

Looks like the scaling factor will be always 1 for bf16 and float32. Do we still compute amax in TypeCast::cast_to_fp16 for bf16 and fp32? Compute amax may be time consuming.

Good catch! I will remove the computation of amax for BF16 and FP32.

wkcn · 2023-11-27T06:57:25Z

@tocean I have tried to remove the amax calculation in ScalingBF16/FP32, but it affects the other calculation logic. I would prefer to optimize this in other late PRs.

tocean · 2023-11-28T14:22:14Z

@tocean I have tried to remove the amax calculation in ScalingBF16/FP32, but it affects the other calculation logic. I would prefer to optimize this in other late PRs.

Sure.

tocean

LGTM

wkcn added 3 commits November 18, 2023 12:38

[Bug Fixed] cast_to_scaling_fp32 or scaling_bf16

1e613eb

Dtypes

c36850a

ut

c1aca79

wkcn requested review from tocean and guoshzhao November 18, 2023 07:02

guoshzhao approved these changes Nov 27, 2023

View reviewed changes

Merge branch 'main' into cast_to_scaling_fp32_or_scaling_bf16

6d57944

wkcn enabled auto-merge (squash) November 28, 2023 00:34

tocean approved these changes Nov 28, 2023

View reviewed changes

wkcn merged commit c69de04 into Azure:main Nov 28, 2023
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug Fixed] scale=INF when casting a tensor to scaling FP32/BF16 tensors #131

[Bug Fixed] scale=INF when casting a tensor to scaling FP32/BF16 tensors #131

wkcn commented Nov 18, 2023 •

edited

Loading

tocean commented Nov 27, 2023 •

edited

Loading

wkcn commented Nov 27, 2023

wkcn commented Nov 27, 2023

tocean commented Nov 28, 2023

tocean left a comment

[Bug Fixed] scale=INF when casting a tensor to scaling FP32/BF16 tensors #131

[Bug Fixed] scale=INF when casting a tensor to scaling FP32/BF16 tensors #131

Conversation

wkcn commented Nov 18, 2023 • edited Loading

tocean commented Nov 27, 2023 • edited Loading

wkcn commented Nov 27, 2023

wkcn commented Nov 27, 2023

tocean commented Nov 28, 2023

tocean left a comment

Choose a reason for hiding this comment

wkcn commented Nov 18, 2023 •

edited

Loading

tocean commented Nov 27, 2023 •

edited

Loading