Fix attn mask ignore logic in training-time trace #32613

zhenglongjiepheonix · 2024-08-11T23:30:52Z

This pr fixes a scenario where we want to use dynamo trace in training mode, the current attn mask ignore logic creates a problem where data-dependent branch condition torch.all(attn_mask==1) will cause graph breaks and disable full-graph tracing, the current solution is to disable mask ignore logic as long as we are in tracing mode no matter we are in training or inference phase.

This will enable compilation for training(forward+backward) like this:

model = LlamaForCausalLM(config).cuda()
model = torch.compile(model, fullgraph=True)
loss = model(**inputs)[0]
loss.backward()

HuggingFaceDocBuilderDev · 2024-08-11T23:50:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Sounds good! similarly to

transformers/tests/test_modeling_common.py

Line 4679 in 209fccc

def test_torch_compile(self):

can you add the training compile test? 🤗

…tracing_for_training

zhenglongjiepheonix · 2024-08-12T21:09:37Z

Added a simple gradient match test in training mode @ArthurZucker . According to my local test, the backward latency remains pretty much the same as eager mode when steady(no matter whether cudagraphs are enabled), but we can still benefit from the forward pass in training, and training-time trace does help in my case where I need to do graph analysis under training mode.

…tracing_for_training

zhenglongjiepheonix · 2024-08-14T22:44:10Z

src/transformers/modeling_attn_mask_utils.py

@@ -276,7 +276,7 @@ def _ignore_causal_mask_sdpa(
        elif sliding_window is None or key_value_length < sliding_window:
            if len(attention_mask.shape) == 4:
                return False
-            elif (is_training or not is_tracing) and torch.all(attention_mask == 1):
+            elif not is_tracing and torch.all(attention_mask == 1):


It's actually a paradox branch where is_tracing and torch.all(attention_mask == 1) can never exist together

ArthurZucker

Nice, can you try to run the slow tests for this! 🤗

ArthurZucker · 2024-08-22T13:27:35Z

tests/test_modeling_common.py

+            "input_ids": torch.randint(
+                low=1, high=model.config.vocab_size, size=(batch_size, seq_len), device=torch_device
+            ),
+            "attention_mask": torch.ones((batch_size, seq_len), dtype=torch.int64, device=torch_device),


attn mask being full of one is more prone to skipping some branches, would try with ones and zeros as welll!

…tracing_for_training

ArthurZucker · 2024-08-28T09:46:56Z

Failure is most probably not related to you, but the slow tests is bad / badly designed as it should either not use accelerate, or not use parallelism to make sure we are testing apples to apples.

ArthurZucker · 2024-08-28T09:47:07Z

Would be nice if you can update the tests 🙏🏻

…tracing_for_training

zhenglongjiepheonix · 2024-09-07T23:13:11Z

Accelerate will kick in if memory is not enough. I think the best solution is just to use the current torch device rather than specifying device_map='sequential', it might cause GPU OOM in this way however. Pytests seem to have problems in avoiding GPU memory fragmentation, see https://discuss.pytorch.org/t/torch-pytest-leads-to-memory-fragmentation-how-to-do-proper-integration-testing-of-a-lot-of-torch-models/201231 , I have run into similar issues where if I run single test alone it passes, and fails because of OOM when I run all the tests all together.

According to the failure information of OOM, the GPU memory allocated but not used is about 7MB, I think it's not significant so fragmentation is not so bad to cause the failure, we simply need GPUs with more memory.

zhenglongjiepheonix · 2024-09-07T23:19:58Z

Failure is most probably not related to you, but the slow tests is bad / badly designed as it should either not use accelerate, or not use parallelism to make sure we are testing apples to apples.

Yes, we should never use accelerate, but in order to make tests pass robustly, we might need GPU runners with more memory, T4 will definitely not be enough because simply loading single model like llama-7b without doing anything would require 13GB

fix attn mask logic for training-time trace

5397d83

zhenglongjiepheonix requested a review from ArthurZucker August 11, 2024 23:32

ArthurZucker reviewed Aug 12, 2024

View reviewed changes

zhenglongjiepheonix added 4 commits August 12, 2024 22:42

add test

3ef2037

fix

0322f81

Merge remote-tracking branch 'upstream/main' into zlj/enable_compile_…

a87e981

…tracing_for_training

fix

02fd2fa

zhenglongjiepheonix added 5 commits August 13, 2024 00:44

fix

bb7f36e

fix

569a9af

fix

12a3649

format

2252397

Merge remote-tracking branch 'upstream/main' into zlj/enable_compile_…

53b7c6c

…tracing_for_training

zhenglongjiepheonix commented Aug 14, 2024

View reviewed changes

zhenglongjiepheonix requested a review from ArthurZucker August 14, 2024 22:45

ArthurZucker approved these changes Aug 22, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into zlj/enable_compile_…

f1d1fa1

…tracing_for_training

zhenglongjiepheonix added the run-slow label Aug 27, 2024

[run-slow] llama

18778c0

zhenglongjiepheonix added 3 commits September 8, 2024 00:36

Merge remote-tracking branch 'upstream/main' into zlj/enable_compile_…

55219dd

…tracing_for_training

avoid accelearate

4f030ff

[run-slow] llama

c649ff1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix attn mask ignore logic in training-time trace #32613

Fix attn mask ignore logic in training-time trace #32613

zhenglongjiepheonix commented Aug 11, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 11, 2024

ArthurZucker left a comment

zhenglongjiepheonix commented Aug 12, 2024 •

edited

Loading

zhenglongjiepheonix Aug 14, 2024

ArthurZucker left a comment

ArthurZucker Aug 22, 2024

ArthurZucker commented Aug 28, 2024

ArthurZucker commented Aug 28, 2024

zhenglongjiepheonix commented Sep 7, 2024 •

edited

Loading

zhenglongjiepheonix commented Sep 7, 2024 •

edited

Loading

Fix attn mask ignore logic in training-time trace #32613

Are you sure you want to change the base?

Fix attn mask ignore logic in training-time trace #32613

Conversation

zhenglongjiepheonix commented Aug 11, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Aug 11, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

zhenglongjiepheonix commented Aug 12, 2024 • edited Loading

zhenglongjiepheonix Aug 14, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker Aug 22, 2024

Choose a reason for hiding this comment

ArthurZucker commented Aug 28, 2024

ArthurZucker commented Aug 28, 2024

zhenglongjiepheonix commented Sep 7, 2024 • edited Loading

zhenglongjiepheonix commented Sep 7, 2024 • edited Loading

zhenglongjiepheonix commented Aug 11, 2024 •

edited

Loading

zhenglongjiepheonix commented Aug 12, 2024 •

edited

Loading

zhenglongjiepheonix commented Sep 7, 2024 •

edited

Loading

zhenglongjiepheonix commented Sep 7, 2024 •

edited

Loading