Allow varying number of student model heads during distillation #3447

EricMichaelSmith · 2021-02-10T13:54:25Z

Patch description
When doing knowledge distillation, allow for the number of attention heads in the student model to differ from that in the teacher model, provided that:
(1) We are doing TinyBERT-style distillation, in which the weights of the teacher model are not copied to the student
(2) The coefficients on the loss terms of the MSE of the attention matrices between the two models are both exactly 0, meaning that we do not have to minimize the differences between them, which would be impossible with a different number of attention heads in each model

Testing steps
Existing distillation CI checks

* Allow number of attention heads to vary * Put checks behind shared

EricMichaelSmith added 2 commits February 9, 2021 16:05

Allow number of attention heads to vary

35054bf

Put checks behind shared

f49e386

EricMichaelSmith requested review from stephenroller and klshuster February 10, 2021 13:54

facebook-github-bot added the CLA Signed label Feb 10, 2021

stephenroller approved these changes Feb 10, 2021

View reviewed changes

klshuster approved these changes Feb 10, 2021

View reviewed changes

EricMichaelSmith merged commit 3786630 into master Feb 10, 2021

EricMichaelSmith deleted the vary-attn-heads branch February 10, 2021 17:31

stephenroller pushed a commit that referenced this pull request Feb 11, 2021

Allow varying number of student model heads during distillation (#3447)

75bac85

* Allow number of attention heads to vary * Put checks behind shared

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow varying number of student model heads during distillation #3447

Allow varying number of student model heads during distillation #3447

EricMichaelSmith commented Feb 10, 2021

Allow varying number of student model heads during distillation #3447

Allow varying number of student model heads during distillation #3447

Conversation

EricMichaelSmith commented Feb 10, 2021