-
Notifications
You must be signed in to change notification settings - Fork 863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reset ptl trainer when loading torch models #1124
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1124 +/- ##
==========================================
+ Coverage 93.65% 93.74% +0.08%
==========================================
Files 79 80 +1
Lines 8137 8147 +10
==========================================
+ Hits 7621 7637 +16
+ Misses 516 510 -6
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
# overwrite the feed forward block | ||
def _ff_block(self, x): | ||
x = self.activation(x) | ||
return self.dropout2(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The glu variants have dropout built-in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but I think we should add the second dropout following the original implementation (see here):
# feed forward block
def _ff_block(self, x: Tensor) -> Tensor:
x = self.linear2(self.dropout(self.activation(self.linear1(x))))
return self.dropout2(x)
then per TransformerEncoderLayer:
def forward(...):
x = self.norm1(x + self._sa_block(x, src_mask, src_key_padding_mask))
x = self.norm2(x + self._ff_block(x))
return x
This is equivalent to what they do in the Annotated Transfomer with SublayerConnection, EncoderLayer and Position-wise Feed-Forward Networks.
Our current FeedForward class does (changed a bit so it's easier to compare):
def forward(self, x)
x = self.linear2(self.dropout(self.activation(self.linear1(x))))
return x
So when we overwrite the PyTorch TransformerEncoderLayer, the last dropout would be missing.
# use glu variant feedforward layers | ||
self.activation = getattr(glu_variants, activation)( | ||
d_model=d_model, d_ff=dim_feedforward, dropout=dropout | ||
) | ||
encoder_layer = _CustomFeedForwardEncoderLayer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the glu variants are full feedforward layers. It wasn't fully correct before to set the activation = glu_variants.
For this to be fully correct, I think there should be a separate arg for ff_bock and leave activation alone.
This customFeedForwardEncoderLayer makes more sense.
dropout=dropout, | ||
activation=self.activation, | ||
) | ||
encoder_norm = nn.LayerNorm(d_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add the new norms when this gets merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (though I didn't check the correctness in details)
Thanks!
Fixes #1116.
Summary