Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Allow varying number of student model heads during distillation #3447

Merged
merged 2 commits into from
Feb 10, 2021

Conversation

EricMichaelSmith
Copy link
Contributor

Patch description
When doing knowledge distillation, allow for the number of attention heads in the student model to differ from that in the teacher model, provided that:
(1) We are doing TinyBERT-style distillation, in which the weights of the teacher model are not copied to the student
(2) The coefficients on the loss terms of the MSE of the attention matrices between the two models are both exactly 0, meaning that we do not have to minimize the differences between them, which would be impossible with a different number of attention heads in each model

Testing steps
Existing distillation CI checks

@EricMichaelSmith EricMichaelSmith merged commit 3786630 into master Feb 10, 2021
@EricMichaelSmith EricMichaelSmith deleted the vary-attn-heads branch February 10, 2021 17:31
stephenroller pushed a commit that referenced this pull request Feb 11, 2021
* Allow number of attention heads to vary

* Put checks behind shared
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants