Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance] Enable using pinned memory for transfers in SparseAdam optimizer. #3207

Merged
merged 3 commits into from
Aug 2, 2021

Conversation

nv-dlasalle
Copy link
Collaborator

Description

In python 1.8, when non_blocking=True in a GPU->CPU copy, the output tensor is allocated in pinned memory (the cause of #2760). However, this now results in a performance regression between 0.6 and 0.7 when using torch>=1.8.

This PR re-enables using pinned memory for those transfers, while synchronizing afterwards to ensure correctness. This cuts time in the optimizer dramatically (on my system nearly 2x--but it will vary from system to system):
opt_time

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented
  • To the my best knowledge, examples are either not affected by this change,
    or have been fixed to be compatible with this change
  • Related issue is referred in this PR

Changes

Used cuda events to allow specifying non_blocking=True when transferring from the gradient computation device (GPU) to the state storage device (CPU).

This also re-arranges some operations in order to better in order to reduce the amount of time the GPU needs to wait on the CPU to finish slicing.

Remove variables used for setting non_blocking=False, as this is the default.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 30, 2021

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

Copy link
Contributor

@classicsong classicsong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@classicsong classicsong merged commit 6001001 into dmlc:master Aug 2, 2021
BarclayII pushed a commit that referenced this pull request Aug 26, 2021
Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants