Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt][CUDA] _convert_to_sampled_subgraph has too many GPU synchronizations. #6887

Closed
Tracked by #6910
mfbalin opened this issue Jan 2, 2024 · 3 comments · Fixed by #7223
Closed
Tracked by #6910

[GraphBolt][CUDA] _convert_to_sampled_subgraph has too many GPU synchronizations. #6887

mfbalin opened this issue Jan 2, 2024 · 3 comments · Fixed by #7223
Labels
Work Item Work items tracked in project tracker

Comments

@mfbalin
Copy link
Collaborator

mfbalin commented Jan 2, 2024

🔨Work Item

IMPORTANT:

  • This template is only for dev team to track project progress. For feature request or bug report, please use the corresponding issue templates.
  • DO NOT create a new work item if the purpose is to fix an existing issue or feature request. We will directly use the issue in the project tracker.

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

As can be seen in the picture below, almost half of the GPU sampling runtime is spent on _convert_to_sampled_subgraph 3ms every time it is called in the hetero sampling example. This should take around 0.1ms with an optimized custom implementation. The main culprit is the use of torch.nonzero causing a CPU GPU synchronization to read the size of the nonzero ids each time it is called.
image

@mfbalin mfbalin added the Work Item Work items tracked in project tracker label Jan 2, 2024
@mfbalin mfbalin self-assigned this Jan 2, 2024
@mfbalin
Copy link
Collaborator Author

mfbalin commented Jan 2, 2024

@Rhett-Ying @yxy235

@Rhett-Ying
Copy link
Collaborator

The main culprit is the use of torch.nonzero causing a CPU GPU synchronization to read the size of the nonzero ids each time it is called.

@mfbalin
The direct cause of sync is the tensor data that torch.nonzero operates on is not on GPU side while the sampled_csc is targeted on GPU?

@mfbalin
Copy link
Collaborator Author

mfbalin commented Feb 26, 2024

The main culprit is the use of torch.nonzero causing a CPU GPU synchronization to read the size of the nonzero ids each time it is called.

@mfbalin The direct cause of sync is the tensor data that torch.nonzero operates on is not on GPU side while the sampled_csc is targeted on GPU?

No, the operation happens on GPU. The reason is that nonzero checks if tensor elements are nonzero, however it is unknown how many will be nonzero. This information is needed on CPU side to allocate output tensor so there is a GPU CPU synchronization.

Calling nonzero for each etype in the graph makes this issue much worse, each time adding a fixed but really high overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Work Item Work items tracked in project tracker
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants