Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix qwen2vl float16 inference bug #33312

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

GeLee-Q
Copy link
Contributor

@GeLee-Q GeLee-Q commented Sep 5, 2024

What does this PR do?

Fixes #33294

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ ✅] Did you read the contributor guideline,
    Pull Request section?
  • [ ✅] Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Infer res

This image shows a table comparing the performance of three precision modes—fp32, bf16, and fp16—in text generation:

fp32: Memory usage is 36629MiB, and it generated a description of a beach scene.
bf16: Memory usage is 44105MiB, and it generated a similar description.
fp16: An error occurred, and only exclamation marks were generated.
fp16 (improved version): Memory usage is 31609MiB, and it generated a more detailed description of the beach scene.
Each column displays the memory usage and the generated text or error information.

image

DeBug Process

Using the hook mechanism, I exported the input and output of each layer. When both the mean and sum are nan, I consider it to be abnormal.

# fp16 hook 
Layer model.layers.0.self_attn.o_proj input - mean: nan, sum: nan, shape: torch.Size([1, 3602, 1536]), has inf: False, has nan: True, first 5 values: [-0.33251953125, 0.1829833984375, 0.064453125, -0.634765625, 0.5859375], last 5 values: [0.038909912109375, -0.08502197265625, 0.4794921875, 0.0762939453125, -0.1522216796875], 
Layer model.layers.0.self_attn.o_proj output - mean: nan, sum: nan, shape: torch.Size([1, 3602, 1536]), has inf: False, has nan: True, first 5 values: [0.0076446533203125, 0.019805908203125, 0.03533935546875, -0.045318603515625, -0.03057861328125], last 5 values: [-0.296875, 0.09539794921875, -0.0924072265625, 0.0084075927734375, -0.09539794921875], 

# fp32 hook 
Layer model.layers.0.self_attn.o_proj input - mean: 0.012129077687859535, sum: 67106.2109375, shape: torch.Size([1, 3602, 1536]), first 5 values: [-0.33251953125, 0.182861328125, 0.06439208984375, -0.63427734375, 0.5859375], last 5 values: [0.03431564196944237, -0.09479156881570816, 0.47655850648880005, 0.06808013468980789, -0.1560981571674347]
Layer model.layers.0.self_attn.o_proj output - mean: 0.008136402815580368, sum: 45016.046875, shape: torch.Size([3602, 1536]), first 5 values: [0.007521241903305054, 0.019816547632217407, 0.03532881289720535, -0.04532395675778389, -0.03064415045082569], last 5 values: [-0.2958013117313385, 0.09288226813077927, -0.08813470602035522, 0.007658728398382664, -0.09701263904571533]
  • Further debugging in the forward process revealed that the attn_weights contained inf values. After applying softmax, nan values appeared, which subsequently caused all the following results to become nan.
class Qwen2VLAttention(nn.Module):
		...
		attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
		...

Thoughts on bug:

  • First, the problematic code is identical to that in LLaMA, so it’s unlikely that the issue is related to the way Torch is written.
  • Second, LLaMA’s fp16 inference works without issues, making it difficult to suspect that the precision problem lies in the underlying operator called by torch.matmul.
  • Lastly, I cautiously suspect that the open-source weights of Qwen-2VL may not be suitable for float16 inference, especially since the Qwen documentation indicates that fp16 is not recommended.
image

Training Comparison

This modification is also effective for training.

fp32

{"current_steps": 1, "total_steps": 100, "loss": 2.4317, "learning_rate": 1e-05, "epoch": 0.7272727272727273, "percentage": 1.0, "elapsed_time": "0:00:02", "remaining_time": "0:04:45"}
{"current_steps": 2, "total_steps": 100, "loss": 2.3595, "learning_rate": 2e-05, "epoch": 1.4545454545454546, "percentage": 2.0, "elapsed_time": "0:00:05", "remaining_time": "0:04:15"}
{"current_steps": 3, "total_steps": 100, "loss": 2.4036, "learning_rate": 3e-05, "epoch": 2.1818181818181817, "percentage": 3.0, "elapsed_time": "0:00:07", "remaining_time": "0:03:55"}
{"current_steps": 4, "total_steps": 100, "loss": 2.3881, "learning_rate": 4e-05, "epoch": 2.909090909090909, "percentage": 4.0, "elapsed_time": "0:00:09", "remaining_time": "0:03:42"}
{"current_steps": 5, "total_steps": 100, "loss": 2.3942, "learning_rate": 5e-05, "epoch": 3.6363636363636362, "percentage": 5.0, "elapsed_time": "0:00:11", "remaining_time": "0:03:34"}

image

fp16

{"current_steps": 1, "total_steps": 1000, "loss": 0.0, "learning_rate": 0.0, "epoch": 0.09195402298850575, "percentage": 0.1, "elapsed_time": "0:00:02", "remaining_time": "0:39:13"}
{"current_steps": 2, "total_steps": 1000, "loss": 0.0, "learning_rate": 0.0, "epoch": 0.1839080459770115, "percentage": 0.2, "elapsed_time": "0:00:04", "remaining_time": "0:35:30"}
{"current_steps": 3, "total_steps": 1000, "loss": 0.0, "learning_rate": 0.0, "epoch": 0.27586206896551724, "percentage": 0.3, "elapsed_time": "0:00:06", "remaining_time": "0:34:16"}
{"current_steps": 4, "total_steps": 1000, "loss": 0.0, "learning_rate": 0.0, "epoch": 0.367816091954023, "percentage": 0.4, "elapsed_time": "0:00:08", "remaining_time": "0:33:24"}
{"current_steps": 5, "total_steps": 1000, "loss": 0.0, "learning_rate": 0.0, "epoch": 0.45977011494252873, "percentage": 0.5, "elapsed_time": "0:00:09", "remaining_time": "0:32:48"}

image

fp16 this PR

{"current_steps": 1, "total_steps": 1000, "loss": 4.5531, "learning_rate": 1.0000000000000002e-06, "epoch": 0.09195402298850575, "percentage": 0.1, "elapsed_time": "0:00:02", "remaining_time": "0:40:28"}
{"current_steps": 2, "total_steps": 1000, "loss": 4.5833, "learning_rate": 2.0000000000000003e-06, "epoch": 0.1839080459770115, "percentage": 0.2, "elapsed_time": "0:00:04", "remaining_time": "0:35:50"}
{"current_steps": 3, "total_steps": 1000, "loss": 4.1749, "learning_rate": 3e-06, "epoch": 0.27586206896551724, "percentage": 0.3, "elapsed_time": "0:00:06", "remaining_time": "0:34:13"}
{"current_steps": 4, "total_steps": 1000, "loss": 4.1741, "learning_rate": 4.000000000000001e-06, "epoch": 0.367816091954023, "percentage": 0.4, "elapsed_time": "0:00:08", "remaining_time": "0:33:19"}
{"current_steps": 5, "total_steps": 1000, "loss": 4.8369, "learning_rate": 5e-06, "epoch": 0.45977011494252873, "percentage": 0.5, "elapsed_time": "0:00:09", "remaining_time": "0:32:58"}
{"c

image

Personal Opinion on This Modification:

  • This is just a trick that replaces all inf values resulting from torch.matmul with zero.
  • This is not a fundamental solution to the problem, but it shows significant effects in fp16 inference and LoRA training. Feedback and appropriate suggestions from everyone are needed.

@ArthurZucker @zucchini-nlp @hiyouga @simonJJJ

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! This indeed looks super useful, my 2 main concerns:

  • speed: at inference in fp16 this will be slower

Otherwise this sounds very much reasonable. I think llama sometimes suffers from this as well

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 6, 2024

Hey! This indeed looks super useful, my 2 main concerns:

  • speed: at inference in fp16 this will be slower

Otherwise this sounds very much reasonable. I think llama sometimes suffers from this as well

Are you suggesting that adding this operation will cause the inference speed to slow down in FP16? But if the precision is off, doesn't speed seem less important? I will test the full process inference speed tomorrow to compare. Additionally, regarding llama, I will test it using float16. This is a very interesting phenomenon.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but let's make sure you run the slow tests actually! Broad support it probably more important than speed here!

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 6, 2024

Approving, but let's make sure you run the slow tests actually! Broad support it probably more important than speed here!

Alright, I'll provide some data tomorrow.

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 7, 2024

Approving, but let's make sure you run the slow tests actually! Broad support it probably more important than speed here!

@ArthurZucker Hello, my slow test has been completed. This operation incurs some overhead, but it's not significant. It accounts for 3.51% of the total CPU time and 0.6% of the total CUDA time across all operators. Trading this small overhead for improved accuracy is, in my opinion, acceptable overall.

Comparison results

image

profile

if query_states.dtype == torch.float16:
     attn_weights = torch.where(torch.isinf(attn_weights), torch.zeros_like(attn_weights), attn_weights)

This modification added three operations: aten::where, aten::isinf, and aten::zeros_like.

  • aten::where:
    • Has high CPU usage, accounting for 0.98% of the total CPU time
    • Also has high CUDA usage, accounting for 0.55% of the total CUDA time
  • aten::isinf: Has the longest total CPU usage time (160.241ms), accounting for 1.70%, which is the highest among the three
  • aten::zeros_like: Has relatively low CPU and CUDA usage

These three operations collectively consume approximately 3.51% of the total CPU time. From the nysys results, we can see that aten::where accounts for 0.6% of the total overhead among all operators.

torch profile

Operation Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg CPU Mem Self CPU Mem CUDA Mem Self CUDA Mem # of Calls Total GFLOPs
aten::where 0.45% 42.024ms 0.98% 92.229ms 27.416us 24.981ms 0.55% 26.552ms 7.893us 0 b 0 b 8.39 Gb 0 b 3364 --
aten::isinf 0.14% 13.013ms 1.70% 160.241ms 49.335us 0.000us 0.00% 33.928ms 10.446us 0 b 0 b 5.71 Gb -7.46 Gb 3248 --
aten::zeros_like 0.15% 13.792ms 0.83% 78.403ms 24.139us 0.000us 0.00% 12.983ms 3.997us 0 b 0 b 8.38 Gb -258.50 Kb 3248 --

nysys profile

Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name
0.6 26321648 3248 8104.0 2816.0 2175 624031 57133.6 void at::native::vectorized_elementwise_kernel<(int)4, at::native::::where_kernel_impl(at:.

test code

...

def log_time_taken(start_time, step_name):
    end_time = time.time()
    logger.info(f"Time taken for {step_name}: {end_time - start_time:.4f} seconds")

...
logger.info("Starting model inference...")
inference_start = time.time()

output_ids = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=False,
    top_k=None,
    top_p=None,
    temperature=None,
)

log_time_taken(inference_start, "Model Inference")

change dtype

# fp32, fp16, bf16 for torch_dtype
model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_dir,
    torch_dtype=torch.bfloat16,
)
model.to(device)

@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 7, 2024

Hey! This indeed looks super useful, my 2 main concerns:

  • speed: at inference in fp16 this will be slower

Otherwise this sounds very much reasonable. I think llama sometimes suffers from this as well

Can you provide an example of an issue with LLaMA? I've tested fp16 on my end and didn't encounter any particularly strange situations.

@GeLee-Q GeLee-Q force-pushed the fix-qwen2vl-fp16-inference-precision branch from 91c8d3d to 07cde33 Compare September 13, 2024 15:03
@GeLee-Q
Copy link
Contributor Author

GeLee-Q commented Sep 13, 2024

@ArthurZucker @zucchini-nlp @ydshieh
Hello, I have completed the speed and accuracy tests. Is there anything else that needs to be done? I have already added the "run-slow" tag. Could you please trigger the action and then merge it in?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being late to re-review!
Really nice numbers, thanks for deep diving. It is completely acceptable IMO! Even more so because it's on the Qwen2VLAttention (so not Qwen2VLSdpaAttention or Flash which would probably no suffer from this!)

Let's merge this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Qwen2-VL FP16 inference results in errors or gibberish output."
2 participants