[Performance] Enable using pinned memory for transfers in SparseAdam optimizer. #3207
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
In python 1.8, when
non_blocking=True
in a GPU->CPU copy, the output tensor is allocated in pinned memory (the cause of #2760). However, this now results in a performance regression between 0.6 and 0.7 when using torch>=1.8.This PR re-enables using pinned memory for those transfers, while synchronizing afterwards to ensure correctness. This cuts time in the optimizer dramatically (on my system nearly 2x--but it will vary from system to system):
Checklist
Please feel free to remove inapplicable items for your PR.
or have been fixed to be compatible with this change
Changes
Used cuda events to allow specifying
non_blocking=True
when transferring from the gradient computation device (GPU) to the state storage device (CPU).This also re-arranges some operations in order to better in order to reduce the amount of time the GPU needs to wait on the CPU to finish slicing.
Remove variables used for setting
non_blocking=False
, as this is the default.