Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OLMoE #32406

Merged
merged 33 commits into from
Sep 3, 2024
Merged

Add OLMoE #32406

merged 33 commits into from
Sep 3, 2024

Conversation

Muennighoff
Copy link
Contributor

@Muennighoff Muennighoff commented Aug 3, 2024

What does this PR do?

Before submitting

Who can review?

The model will be released in ~1 week - can we already review this so that we can merge right upon release?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Feel free to ping me for a review once ready!

@Muennighoff
Copy link
Contributor Author

Hey! Feel free to ping me for a review once ready!

It's ready! :) I will update the README & double-check the slow tests once the model is released if that's fine!

@Muennighoff
Copy link
Contributor Author

@ArthurZucker would be great if we could get it reviewed soon 😇

@Muennighoff
Copy link
Contributor Author

We'll release the model on Tuesday, would be amazing if we could have this approved by then!

@ArthurZucker
Copy link
Collaborator

Oups sorry reviewing now

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review!
Mostly missing copied from and should be good to go otherwise.
Could go the extra mile to have compile compatible verseion of the MOE blcok!

docs/source/en/model_doc/olmoe.md Outdated Show resolved Hide resolved
src/transformers/models/olmoe/configuration_olmoe.py Outdated Show resolved Hide resolved
@@ -0,0 +1,281 @@
# Licensed under the Apache License, Version 2.0 (the "License");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date is missing here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that mandatory? Maybe the entire header can just be removed? Seems redundant to have this header for every file when the license is clear from the repo..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄 yeah it is redundant but the licence AFAIK but it's a nit don't worry

src/transformers/models/olmoe/modeling_olmoe.py Outdated Show resolved Hide resolved
return final_hidden_states, router_logits


class OlmoeDecoderLayer(nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sam ehere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different from others cuz we have no shared expert

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker
Copy link
Collaborator

COuld you make sure you rebased and CIs are green? Will review again after that!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM let's rebase and fix the CIs 🤗 (there is also a todo in the .md as well)

@Muennighoff
Copy link
Contributor Author

Great, all passing! 🙌

@ArthurZucker ArthurZucker merged commit ecd61c6 into huggingface:main Sep 3, 2024
23 checks passed
@ArthurZucker
Copy link
Collaborator

Thanks for your contribution! 🔥

@kalomaze
Copy link

kalomaze commented Sep 5, 2024

Is it a problem on my end if I notice that a higher native batch size during training results in higher losses? (With DeepSpeed)

image

Purple line is higher native bs FFT, orange line is gradient accumulation instead of a higher native batch size.

I'm guessing expert routing parallelism is maybe not properly handled for bs>1?

@ArthurZucker
Copy link
Collaborator

cc @Muennighoff would have a lot more clues than me 😉

@Muennighoff
Copy link
Contributor Author

Is it a problem on my end if I notice that a higher native batch size during training results in higher losses? (With DeepSpeed)

image

Purple line is higher native bs FFT, orange line is gradient accumulation instead of a higher native batch size.

I'm guessing expert routing parallelism is maybe not properly handled for bs>1?

Hm not sure about this - are other models the same for you in both scenarios? How about other MoEs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants