[Performance] Enable using pinned memory for transfers in SparseAdam optimizer. #3207

nv-dlasalle · 2021-07-30T23:46:21Z

Description

In python 1.8, when non_blocking=True in a GPU->CPU copy, the output tensor is allocated in pinned memory (the cause of #2760). However, this now results in a performance regression between 0.6 and 0.7 when using torch>=1.8.

This PR re-enables using pinned memory for those transfers, while synchronizing afterwards to ensure correctness. This cuts time in the optimizer dramatically (on my system nearly 2x--but it will vary from system to system):

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented
To the my best knowledge, examples are either not affected by this change,
or have been fixed to be compatible with this change
Related issue is referred in this PR

Changes

Used cuda events to allow specifying non_blocking=True when transferring from the gradient computation device (GPU) to the state storage device (CPU).

This also re-arranges some operations in order to better in order to reduce the amount of time the GPU needs to wait on the CPU to finish slicing.

Remove variables used for setting non_blocking=False, as this is the default.

dgl-bot · 2021-07-30T23:47:20Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

classicsong

LGTM

Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

Improve usage of pinned memory in sparse_optimizer

0fbc197

nv-dlasalle requested a review from classicsong July 30, 2021 23:46

classicsong approved these changes Aug 2, 2021

View reviewed changes

classicsong added 2 commits August 2, 2021 12:20

Merge branch 'master' into opt_fix

a5913ba

Merge branch 'master' into opt_fix

4e114c7

classicsong merged commit 6001001 into dmlc:master Aug 2, 2021

BarclayII pushed a commit that referenced this pull request Aug 26, 2021

Improve usage of pinned memory in sparse_optimizer (#3207)

988cf98

Co-authored-by: xiang song(charlie.song) <classicxsong@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Enable using pinned memory for transfers in SparseAdam optimizer. #3207

[Performance] Enable using pinned memory for transfers in SparseAdam optimizer. #3207

nv-dlasalle commented Jul 30, 2021

dgl-bot commented Jul 30, 2021

classicsong left a comment

[Performance] Enable using pinned memory for transfers in SparseAdam optimizer. #3207

[Performance] Enable using pinned memory for transfers in SparseAdam optimizer. #3207

Conversation

nv-dlasalle commented Jul 30, 2021

Description

Checklist

Changes

dgl-bot commented Jul 30, 2021

classicsong left a comment

Choose a reason for hiding this comment