[RoBERTa-based] Add support for sdpa #30510

hackyon · 2024-04-26T20:17:37Z

What does this PR do?

Adding support for SDPA (scaled dot product attention) for RoBERTa-based models. More context in #28005 and #28802.

Models: camembert, roberta, xlm_roberta, xlm_roberta_xl.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@fxmarty @ArthurZucker @amyeroberts

hackyon · 2024-04-26T21:26:31Z

I ran slow tests for the affected models, and verified that they all pass except XLMRobertaXLModelTest::test_eager_matches_sdpa_generate(). I suspect it's just some numerical computation error, but I'll take a quick look to see if I can find anything.

I'll also try to run some the perf benchmarks on RoBERTa over the weekend to see how they behave.

hackyon · 2024-04-27T00:15:01Z

Preliminary perf numbers for Roberta (using "roberta-base" with AutoModel/Tokenizer).

Training

num_training_steps	batch_size	seq_len	is cuda	Time per batch (eager - s)	Time per batch (sdpa - s)	Speedup (%)	Eager peak mem (MB)	sdpa peak mem (MB)	Mem saving (%)
1000	1	256	True	0.018	0.015	24.411	731.752	736.471	-0.641
1000	1	512	True	0.019	0.016	17.819	823.792	757.096	8.809
1000	2	256	True	0.020	0.016	29.890	760.504	757.096	0.450
1000	2	512	True	0.020	0.016	25.317	1283.793	907.688	41.435
1000	4	256	True	0.020	0.016	28.907	1094.001	907.289	20.579
1000	4	512	True	0.025	0.021	19.153	2205.299	1446.666	52.440

Inference

num_batches	batch_size	seq_len	is cuda	is half	use mask	Per token latency eager (ms)	Per token latency SDPA (ms)	Speedup (%)	Mem eager (MB)	Mem BT (MB)	Mem saved (%)
50	2	64	True	True	True	5.357	5.067	5.716	333.956	333.956	0
50	2	128	True	True	True	5.534	5.181	6.812	360.089	360.089	0
50	2	256	True	True	True	5.823	5.516	5.577	412.355	412.355	0
50	4	64	True	True	True	5.632	5.344	5.381	385.611	385.611	0
50	4	128	True	True	True	6.101	5.849	4.304	437.895	437.877	0.004
50	4	256	True	True	True	6.91	6.529	5.824	542.598	542.598	0

hackyon · 2024-04-27T01:24:19Z

It seems like XLMRobertaXLModelTest::test_eager_matches_sdpa_generate() doesn't always fail, but it's flaky and depends on the random number generator. I think it is due to computation/numerical stability, which can result in slightly different results.

EDIT: I added a set_seed(0) to XLMRobertaXLModelTest::test_eager_matches_sdpa_generate(), and the flake seems to have gone away.

hackyon · 2024-04-29T17:50:18Z

@fxmarty @ArthurZucker @amyeroberts

This is ready for review! With the exception of the changes to the test and check_support_list.py, all the changes are coming from "Copied From". Please let me know if you have any questions!

michaelshekasta · 2024-05-08T21:15:13Z

@hackyon, I'm curious about whether implementing flash_atten is essential when writing an SDPA. I came across claims that flash_atten can offer up to a x4 efficiency boost (roughly) compared to native PyTorch. However, your remarks in #30510 suggest that the actual improvement is less than 50%. Could you help shed some light on this apparent difference?

hackyon · 2024-05-19T12:14:26Z

@michaelshekasta I believe the 4x improvement only applies to certain models, usually larger models with more computationally expensive attention computations.

ArthurZucker · 2024-05-23T13:42:42Z

@fxmarty can you have a look and ping me for the final review? 🤗

nbroad1881 · 2024-06-18T01:20:19Z

@fxmarty , gentle bump

fxmarty

LGTM, just a fix that I think needs to be made for cross attention for the is_causal param

The test here https://github.com/huggingface/transformers/pull/30138/files#diff-681c988a50a31869d1756f2db71904939c639617569a5168d7b3167fe8da0b48 could also be extended for extra safety, but up to you.

src/transformers/models/camembert/modeling_camembert.py

src/transformers/models/roberta/modeling_roberta.py

src/transformers/models/xlm_roberta/modeling_xlm_roberta.py

src/transformers/models/xlm_roberta_xl/modeling_xlm_roberta_xl.py

HuggingFaceDocBuilderDev · 2024-06-24T09:22:26Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

michaelshekasta · 2024-06-24T10:05:06Z

@fxmarty what's left? How can I help?

michaelshekasta · 2024-07-10T13:40:11Z

@fxmarty you are amazing! If I can help, please write to me

kiszk · 2024-07-10T18:30:42Z

@fxmarty Thank you very much. I would appreciate it if you could re-add gpt_neox for consistency. Or can I do it?
I am not sure why it was dropped.

https://app.circleci.com/pipelines/github/huggingface/transformers/97500/workflows/4facc164-8c3b-4ad0-9387-be9de636e686/jobs/1291191?utm_campaign=vcs-integration-link&utm_medium=referral&utm_source=github-checks-link&utm_content=summary

Traceback (most recent call last):
  File "/root/transformers/utils/check_support_list.py", line 97, in <module>
    check_sdpa_support_list()
  File "/root/transformers/utils/check_support_list.py", line 90, in check_sdpa_support_list
    raise ValueError(
ValueError: gpt_neox should be in listed in the SDPA documentation but is not. Please update the documentation.

Exited with code exit status 1

docs/source/en/perf_infer_gpu_one.md

fxmarty · 2024-07-11T09:24:49Z

Thanks @kiszk, missed it when reordering the lists.

fxmarty · 2024-07-12T16:02:32Z

gentle ping @ArthurZucker @amyeroberts

fxmarty · 2024-07-16T09:34:09Z

@ArthurZucker @amyeroberts

kiszk · 2024-07-22T06:23:38Z

@fxmarty You may want to resolve conflicts.

ArthurZucker · 2024-07-26T10:25:30Z

Sorry did not have time before, will try to do today or next week. It's a big PR with lots of changes, need to be extra careful!

kiszk · 2024-08-13T02:41:07Z

@ArthurZucker would you have a time for this review?

hotchpotch · 2024-08-15T19:36:18Z

I've also experienced approximately 20% faster training with XLMRoberta using this PR on an RTX4090. I've been testing it for over a week now, and it's been working without any issues. I sincerely hope this can be merged.

kiszk · 2024-08-27T15:43:33Z

@ArthurZucker Can we help with anything reviewing this PR?

ArthurZucker

I kept pushing this back, it's on me! I'll solve whatever comes up with this merge.
Thanks @hackyon for your hard work LGTM!

michaelshekasta · 2024-08-28T09:13:18Z

@ArthurZucker when do you think that this change will appear in transformers package? next version?

P.S. You are so amazing guys!

ArthurZucker · 2024-08-28T09:24:27Z

It should be there in at most 2 weeks! 🤗

hotchpotch · 2024-08-28T21:34:15Z

I would like to thank everyone involved in this Pull Request from the bottom of my heart! 🎉

* Adding SDPA support for RoBERTa-based models * add not is_cross_attention * fix copies * fix test * add minimal test for camembert and xlm_roberta as their test class does not inherit from ModelTesterMixin * address some review comments * use copied from * style * consistency * fix lists --------- Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

michaelshekasta · 2024-09-10T20:53:02Z

@ArthurZucker A gentle reminder ;-)

It should be there in at most 2 weeks! 🤗

@ArthurZucker A gentle remider ;-)

hackyon force-pushed the sdpa-roberta branch from a423a7b to fe6937d Compare April 26, 2024 20:54

hackyon mentioned this pull request Apr 26, 2024

[BERT] Add support for sdpa #28802

Merged

5 tasks

hackyon force-pushed the sdpa-roberta branch 3 times, most recently from ecadcfb to f6e5e74 Compare April 26, 2024 21:23

hackyon force-pushed the sdpa-roberta branch 2 times, most recently from c39f457 to 41537e3 Compare April 29, 2024 17:42

Adding SDPA support for RoBERTa-based models

3a00f83

hackyon force-pushed the sdpa-roberta branch from 41537e3 to 3a00f83 Compare April 29, 2024 17:42

hackyon marked this pull request as ready for review April 29, 2024 17:50

hackyon mentioned this pull request May 8, 2024

Add FlashAttention2 for XLM-RoBERTa #28713

Closed

5 tasks

fxmarty approved these changes Jun 21, 2024

View reviewed changes

nbroad1881 mentioned this pull request Jun 22, 2024

Add FA2 & SDPA support for RoBERTa & XLM-RoBERTa #30450

Closed

4 tasks

michaelshekasta mentioned this pull request Jun 24, 2024

Sdpa roberta hackyon/transformers#1

Closed

fxmarty added 3 commits June 24, 2024 10:34

add not is_cross_attention

715f214

Merge branch 'main' into sdpa-roberta

64a8b08

fix copies

1c9e191

fix test

575fd79

fxmarty requested review from ArthurZucker and amyeroberts June 24, 2024 10:25

consistency

cff2fda

kiszk reviewed Jul 11, 2024

View reviewed changes

docs/source/en/perf_infer_gpu_one.md Show resolved Hide resolved

kiszk reviewed Jul 11, 2024

View reviewed changes

docs/source/en/perf_infer_gpu_one.md Outdated Show resolved Hide resolved

fix lists

9b817ce

fxmarty requested a review from amyeroberts July 11, 2024 12:03

fxmarty requested review from amyeroberts and ArthurZucker and removed request for amyeroberts and ArthurZucker July 22, 2024 09:44

ArthurZucker approved these changes Aug 28, 2024

View reviewed changes

Merge branch 'main' into sdpa-roberta

c475903

ArthurZucker merged commit f1a385b into huggingface:main Aug 28, 2024
22 checks passed

ArthurZucker mentioned this pull request Sep 6, 2024

XLMRoberta with Flash Attention 2 #27957

Open

4 tasks

MohammedAlhajji mentioned this pull request Sep 12, 2024

Error in Fully Sharded Data Parallelism (FSDP) set up UKPLab/sentence-transformers#2931

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RoBERTa-based] Add support for sdpa #30510

[RoBERTa-based] Add support for sdpa #30510

hackyon commented Apr 26, 2024 •

edited by fxmarty

Loading

hackyon commented Apr 26, 2024

hackyon commented Apr 27, 2024

hackyon commented Apr 27, 2024 •

edited

Loading

hackyon commented Apr 29, 2024

michaelshekasta commented May 8, 2024

hackyon commented May 19, 2024 •

edited

Loading

ArthurZucker commented May 23, 2024

nbroad1881 commented Jun 18, 2024

fxmarty left a comment

HuggingFaceDocBuilderDev commented Jun 24, 2024

michaelshekasta commented Jun 24, 2024

michaelshekasta commented Jul 10, 2024

kiszk commented Jul 10, 2024 •

edited

Loading

fxmarty commented Jul 11, 2024 •

edited

Loading

fxmarty commented Jul 12, 2024

fxmarty commented Jul 16, 2024

kiszk commented Jul 22, 2024

ArthurZucker commented Jul 26, 2024

kiszk commented Aug 13, 2024

hotchpotch commented Aug 15, 2024

kiszk commented Aug 27, 2024

ArthurZucker left a comment

michaelshekasta commented Aug 28, 2024 •

edited

Loading

ArthurZucker commented Aug 28, 2024

hotchpotch commented Aug 28, 2024

michaelshekasta commented Sep 10, 2024

[RoBERTa-based] Add support for sdpa #30510

[RoBERTa-based] Add support for sdpa #30510

Conversation

hackyon commented Apr 26, 2024 • edited by fxmarty Loading

What does this PR do?

Before submitting

Who can review?

hackyon commented Apr 26, 2024

hackyon commented Apr 27, 2024

hackyon commented Apr 27, 2024 • edited Loading

hackyon commented Apr 29, 2024

michaelshekasta commented May 8, 2024

hackyon commented May 19, 2024 • edited Loading

ArthurZucker commented May 23, 2024

nbroad1881 commented Jun 18, 2024

fxmarty left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Jun 24, 2024

michaelshekasta commented Jun 24, 2024

michaelshekasta commented Jul 10, 2024

kiszk commented Jul 10, 2024 • edited Loading

fxmarty commented Jul 11, 2024 • edited Loading

fxmarty commented Jul 12, 2024

fxmarty commented Jul 16, 2024

kiszk commented Jul 22, 2024

ArthurZucker commented Jul 26, 2024

kiszk commented Aug 13, 2024

hotchpotch commented Aug 15, 2024

kiszk commented Aug 27, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

michaelshekasta commented Aug 28, 2024 • edited Loading

ArthurZucker commented Aug 28, 2024

hotchpotch commented Aug 28, 2024

michaelshekasta commented Sep 10, 2024

hackyon commented Apr 26, 2024 •

edited by fxmarty

Loading

hackyon commented Apr 27, 2024 •

edited

Loading

hackyon commented May 19, 2024 •

edited

Loading

kiszk commented Jul 10, 2024 •

edited

Loading

fxmarty commented Jul 11, 2024 •

edited

Loading

michaelshekasta commented Aug 28, 2024 •

edited

Loading