Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Support Multi-GPU for Transformer model #356

Merged
merged 16 commits into from
Feb 12, 2019

Conversation

lingfanyu
Copy link
Collaborator

@lingfanyu lingfanyu commented Jan 14, 2019

Description

This PR uses mutli-process to parallelize training of Transformer model. With batch size 128, 4 GPU gives 3.3x speed up. With batch size 4096, 4 GPU give 3.8x speed up.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change

@lingfanyu
Copy link
Collaborator Author

@yzh119 Let's work together on this PR. Currently, there seems to be a bug in this PR. If you try the grad_accum argument to accumulate gradients of multiple batch, the loss drops very slowly.

Close for now. Reopen when we finish.

@lingfanyu lingfanyu closed this Jan 14, 2019
@yzh119
Copy link
Member

yzh119 commented Jan 15, 2019

torch.distributed hangs at initialization if we do not specify the first gpu_id to 0.

@yzh119
Copy link
Member

yzh119 commented Jan 15, 2019

I have no write access to your repo, so I put the updated version here.
The issue of not converge may result from incorrect schedule settings, I've fixed the problem and now it successfully converges in 4-GPU setting. (But I think in small data like multi30k, shuffle first then divide is a better choice)

@lingfanyu
Copy link
Collaborator Author

@yzh119 just added you as collaborator. Can you move your changes here?

@lingfanyu lingfanyu reopened this Jan 15, 2019
@yzh119
Copy link
Member

yzh119 commented Jan 19, 2019

With sparse_softmax kernel, the problem of not convergence does not appear anymore.

@lingfanyu
Copy link
Collaborator Author

@yzh119 I think I have finished this PR and fixed the synchronization bug. And now the 4 GPU behaves like single GPU.

Can you review again and then we can merge.

@lingfanyu lingfanyu merged commit 29dd22e into dmlc:master Feb 12, 2019
@lingfanyu lingfanyu deleted the mgpu-transformer branch February 12, 2019 01:17
@jermainewang jermainewang mentioned this pull request Feb 18, 2019
26 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants