[`BC 4.37 -> 4.38`] for Llama family, memory and speed #29753

ArthurZucker · 2024-03-20T11:34:16Z

What does this PR do?

Fixes the BC issues between the two versions in term of memory consumption.
This fix is made a lot easier by all the tests, so thanks a lot @gante!

fixes #29412, fixes #29484 , fixes #29644, fixes #29651

…ausal-mask-dispatch

…nsformers into fix-causal-mask-dispatch

…ausal-mask-dispatch

gante

In general looks good to me, although I'm not 100% sure on the causal_mask *= torch.arange(target_length, device=device) > cache_position[0] line + assisted generation -- going to have a look

HuggingFaceDocBuilderDev · 2024-03-20T12:14:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2024-03-20T13:32:28Z

torch script can be fixed by @fxmarty in a follow up PR + patch IMO!

fxmarty · 2024-03-20T13:51:28Z

@ArthurZucker that would be great if torchscript/fx tests pass &

optimum-cli export onnx --model fxmarty/tiny-llama-fast-tokenizer llama_onnx

does not break

younesbelkada

Thanks for the offline explanation ! this shouldn't affect FA2 as we always return attention_mask without processing it for FA2 modules in _update_causal_mask, as you explained offline causal_mask *= torch.arange(target_length, device=device) > cache_position[0] is used to mask out the cached hidden states

src/transformers/models/cohere/modeling_cohere.py

src/transformers/models/llama/modeling_llama.py

gante · 2024-03-20T14:52:55Z

@ArthurZucker the line causal_mask *= torch.arange(target_length, device=device) > cache_position[0] should become causal_mask *= torch.arange(target_length, device=device) > cache_position.reshape(-1, 1)

In the previous version, in a situation with 17 cached tokens and 4 new assistant tokens, the causal mask would become

(Pdb) causal_mask
tensor([[[[     0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0., -65504., -65504., -65504., -65504.],
          [     0.,      0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0., -65504., -65504., -65504., -65504.],
          [     0.,      0.,      0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0., -65504., -65504., -65504., -65504.],
          [     0.,      0.,      0.,      0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0., -65504., -65504., -65504., -65504.],
          [     0.,      0.,      0.,      0.,      0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0., -65504., -65504., -65504., -65504.]]]],
       device='cuda:0', dtype=torch.float16)

i.e. not upper triangular. After the suggested change, it becomes

(Pdb) causal_mask
tensor([[[[     0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0., -65504., -65504., -65504., -65504.],
          [     0.,      0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0., -65504., -65504., -65504.],
          [     0.,      0.,      0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0.,     -0., -65504., -65504.],
          [     0.,      0.,      0.,      0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0.,     -0.,     -0., -65504.],
          [     0.,      0.,      0.,      0.,      0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.,
               -0.,     -0.,     -0.,     -0.,     -0.,     -0.,     -0.]]]],
       device='cuda:0', dtype=torch.float16)

ArthurZucker · 2024-03-20T14:55:39Z

Any idea why our tests don't complain?

gante · 2024-03-20T15:29:01Z

Any idea why our tests don't complain?

I think we don't have hard correctness checks for assisted generation 🙈 only API checks

* attempt to fix * the actual fix that works with compilation! * this? * temporary update * nit? * dispatcg to memory efficient? * update both models that have static cache support * fix copies fix compile * make sure fix * fix cohere and gemma * fix beams? * nit * slipped through the cracks * nit * nits * update * fix-copies * skip failing tests * nits

poedator · 2024-03-22T07:53:22Z

congratulations with this PR fixing many important things!
Unfortunately, it broke ~~my heart~~ custom 4D masks support.
try RUN_SLOW=1 python -m pytest -v ./tests/test_modeling_utils.py::Mask4DTestHard::test_partial_stacked_causal_mask

@gante and myself introduced this test recently in #29731 - please make it a part of the test suite for all things related to attention_masks and StaticCache

here is what happens (numbers based on test_partial_stacked_causal_mask after this line 2170:
in modeling_llama.py::_update_causal_mask()
the causal_mask has shape (1, 1, sequence_length, target_length) (1,1,9,12) (3 tokens in cache)
if the custom 4D mask enters with same shape, it triggers offset = 3 and then causes error when copied over causal mask.
if the custom 4D mask enters with shape (1, 1, 12, 12), offset stays at zero but then again this causes error when copied over causal mask, for mask_slice and causal_mask shapes don't match.
There are also changes here which may affect this test.

I hesitate to offer a PR because it may break other things that you try to do with this part of code. But, please, make the test work. BTW, it may be OK to change the test, for instance passing the whole bigger mask including cached items by editing this line to mask_1b = mask_1

Special note on StaticCache: I like this feature and I want to use custom 4D masks with it. So far this is not tested. I'd be glad to contribute such test once this issue is fixed. It will look like test_partial_stacked_causal_mask, only with StaticCache.

cc @ArthurZucker @gante

ArthurZucker · 2024-03-25T08:53:34Z

I can try to fix it, and yes I thought it would be tested automatically. It should be part of the tokenization_common or at least LlamaIntegrationTests or something. cc @gante if you can take a look I'll gladly review a PR

gante · 2024-03-28T10:38:11Z

^ this PR should fix it 🤗

* attempt to fix * the actual fix that works with compilation! * this? * temporary update * nit? * dispatcg to memory efficient? * update both models that have static cache support * fix copies fix compile * make sure fix * fix cohere and gemma * fix beams? * nit * slipped through the cracks * nit * nits * update * fix-copies * skip failing tests * nits

ArthurZucker added 12 commits March 19, 2024 23:02

attempt to fix

2fd8c12

Merge branch 'main' of github.com:huggingface/transformers into fix-c…

9c204e0

…ausal-mask-dispatch

the actual fix that works with compilation!

2af3c7c

this?

aece6ca

Merge branch 'fix-causal-mask-dispatch' of github.com:huggingface/tra…

7c29cb8

…nsformers into fix-causal-mask-dispatch

temporary update

e45fbf8

nit?

9608a96

dispatcg to memory efficient?

9f07ab7

update both models that have static cache support

c961ee8

Merge branch 'main' of github.com:huggingface/transformers into fix-c…

9872d2f

…ausal-mask-dispatch

fix copies fix compile

05f7d8b

make sure fix

d667700

ArthurZucker changed the title ~~[BC 4.37 -> 4.38]~~ [BC 4.37 -> 4.38] for Llama family, memory and speed Mar 20, 2024

fix cohere and gemma

c6cec07

gante reviewed Mar 20, 2024

View reviewed changes

ArthurZucker added 3 commits March 20, 2024 13:30

fix beams?

c3d5dac

nit

050eb20

slipped through the cracks

bbba5b5

ArthurZucker marked this pull request as ready for review March 20, 2024 12:52

ArthurZucker added 2 commits March 20, 2024 14:09

nit

d9f3ea3

nits

1e20ce9

younesbelkada approved these changes Mar 20, 2024

View reviewed changes

src/transformers/models/cohere/modeling_cohere.py Show resolved Hide resolved

src/transformers/models/llama/modeling_llama.py Show resolved Hide resolved

ArthurZucker added 2 commits March 20, 2024 16:00

update

f79ea4e

fix-copies

bc725de

ArthurZucker added 2 commits March 20, 2024 23:31

skip failing tests

b46e447

nits

1abd098

ArthurZucker merged commit ff84190 into main Mar 20, 2024
19 checks passed

ArthurZucker deleted the fix-causal-mask-dispatch branch March 20, 2024 22:47

ArthurZucker mentioned this pull request Mar 21, 2024

[Core generation] Adds support for static KV cache #27931

Merged

4 tasks

ArthurZucker mentioned this pull request Mar 25, 2024

Gemma optimizations for finetuning and infernece #29616

Closed

4 tasks

ArthurZucker mentioned this pull request Mar 25, 2024

[Regression] Yi 200K models won't load in latest release #29252

Closed

4 tasks

gante mentioned this pull request Mar 28, 2024

Llama: fix custom 4D masks #29930

Closed

ArthurZucker mentioned this pull request Mar 28, 2024

Fix causal mask in llama for long seq_length #29263

Closed

warner-benjamin mentioned this pull request Apr 3, 2024

Llama uses significantly more memory in 4.38 & 4.39 than 4.37 with identical code #30010

Closed

4 tasks

poedator mentioned this pull request Apr 19, 2024

Llama: fix custom 4D masks, v2 #30348

Merged

ArthurZucker mentioned this pull request Aug 20, 2024

Make StaticCache configurable at model construct time #32830

Merged

4 tasks

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`BC 4.37 -> 4.38`] for Llama family, memory and speed #29753

[`BC 4.37 -> 4.38`] for Llama family, memory and speed #29753

ArthurZucker commented Mar 20, 2024 •

edited

Loading

gante left a comment

HuggingFaceDocBuilderDev commented Mar 20, 2024

ArthurZucker commented Mar 20, 2024

fxmarty commented Mar 20, 2024

younesbelkada left a comment

gante commented Mar 20, 2024

ArthurZucker commented Mar 20, 2024

gante commented Mar 20, 2024

poedator commented Mar 22, 2024

ArthurZucker commented Mar 25, 2024

gante commented Mar 28, 2024

[BC 4.37 -> 4.38] for Llama family, memory and speed #29753

[BC 4.37 -> 4.38] for Llama family, memory and speed #29753

Conversation

ArthurZucker commented Mar 20, 2024 • edited Loading

What does this PR do?

gante left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Mar 20, 2024

ArthurZucker commented Mar 20, 2024

fxmarty commented Mar 20, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

gante commented Mar 20, 2024

ArthurZucker commented Mar 20, 2024

gante commented Mar 20, 2024

poedator commented Mar 22, 2024

ArthurZucker commented Mar 25, 2024

gante commented Mar 28, 2024

[`BC 4.37 -> 4.38`] for Llama family, memory and speed #29753

[`BC 4.37 -> 4.38`] for Llama family, memory and speed #29753

ArthurZucker commented Mar 20, 2024 •

edited

Loading