[GraphBolt][CUDA] Async sample neighbors and compaction. #7682

mfbalin · 2024-08-11T18:50:06Z

Description

We want to hide the latency of GPU CPU synchronization using pipelining. We want to eliminate the white gaps in the profile below:

Before:

After:

This is going to be achieved by using pipelining so that the output of one stage is not required by the next. Then, we can launch kernels for all stages at the same time in an async manner, ensuring there are no white gaps.

Preliminary results:

Without torch.compile

Without asynchronous: 1.51s
With asynchronous: 1.37s

With torch.compile:

Without asynchronous: 1.11s
WIth asynchronous: 0.98s

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
I've leverage the tools to beautify the python and c++ code.
The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

dgl-bot · 2024-08-11T18:50:33Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot · 2024-08-11T19:21:50Z

Commit ID: 9043fc16b7b2444ef65c62d197348452f230d45e

Build ID: 1

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-08-12T02:53:27Z

Commit ID: 2eda203

Build ID: 2

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-08-12T14:49:42Z

Commit ID: b3b17bb

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-08-12T21:04:16Z

Commit ID: 9453834

Build ID: 4

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-08-12T21:35:15Z

Commit ID: 7f541e8

Build ID: 5

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-08-13T02:01:58Z

Commit ID: c9d789d902df587dc3937f3454652bf181abc896

Build ID: 6

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

dgl-bot · 2024-08-13T02:55:21Z

Commit ID: d9376023f47221257fc1a97371d7d1d730e96f2c

Build ID: 7

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-08-13T02:57:43Z

Commit ID: 2fbbdedb4c4c19e54fa217f5c444ba9fb113e4d4

Build ID: 8

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-08-13T03:28:57Z

Commit ID: 1c429e55f75f7f35ef5d0958a72c65b0363b93c9

Build ID: 9

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-08-13T14:27:15Z

Commit ID: 5e802549614b9d0c09f7b24945f05f3658306c2e

Build ID: 10

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-08-13T16:14:37Z

Commit ID: c6a8414

Build ID: 11

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-08-14T20:26:32Z

Commit ID: 61c8fa9939e3456a64abcdf5d7fa9fe504bc04c9

Build ID: 12

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-08-14T20:57:49Z

Commit ID: 2375b47d860b59439158c013592fd347b635d2d5

Build ID: 13

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-08-14T21:47:24Z

Commit ID: a57474c4b2e4d1a76cadf083e92805913a462290

Build ID: 14

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

frozenbugs · 2024-08-20T09:53:06Z

LGTM

mfbalin added 2 commits August 11, 2024 14:15

[GraphBolt][CUDA] Async sampling and compacting.

7b85082

add async unique as well.

3f332b3

mfbalin requested a review from frozenbugs August 11, 2024 18:50

mfbalin mentioned this pull request Aug 11, 2024

[GraphBolt][CUDA] Refactor overlap_graph_fetch, simplify gb.DataLoader. #7681

Merged

8 tasks

Merge branch 'master' into gb_cuda_async_sample_neighbors

2eda203

backup

b3b17bb

mfbalin marked this pull request as draft August 12, 2024 14:18

mfbalin added 2 commits August 12, 2024 16:58

more progress

9453834

enable in the example for testing.

7f541e8

make compact async as well.

6f4deac

fix the impl.

ddfd026

mfbalin marked this pull request as ready for review August 13, 2024 02:56

remove example change.

3c76e58

mfbalin force-pushed the gb_cuda_async_sample_neighbors branch from 76b766d to 3c76e58 Compare August 13, 2024 02:56

rename is_asynchronous to async_op, following torch.

7d0e634

mfbalin mentioned this pull request Aug 13, 2024

[GraphBolt][CUDA] Use same CUDA stream in async op. #7693

Merged

8 tasks

Merge branch 'master' into gb_cuda_async_sample_neighbors

c6a8414

add missing docstring

0666c0e

add more missing docstring.

c0d2107

fix the test.

bdbfa4a

mfbalin added the expedited if it doesn't affect the main path approve first to unblock related projects, and review later label Aug 14, 2024

mfbalin merged commit 60d0b66 into dmlc:master Aug 14, 2024
2 checks passed

mfbalin deleted the gb_cuda_async_sample_neighbors branch August 14, 2024 21:51

mfbalin mentioned this pull request Aug 15, 2024

[GraphBolt][CUDA] Eliminate GPU synchronizations as much as we can. #6910

Closed

6 tasks

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GraphBolt][CUDA] Async sample neighbors and compaction. #7682

[GraphBolt][CUDA] Async sample neighbors and compaction. #7682

mfbalin commented Aug 11, 2024 •

edited

Loading

dgl-bot commented Aug 11, 2024

dgl-bot commented Aug 11, 2024

dgl-bot commented Aug 12, 2024

dgl-bot commented Aug 12, 2024

dgl-bot commented Aug 12, 2024

dgl-bot commented Aug 12, 2024

dgl-bot commented Aug 13, 2024

dgl-bot commented Aug 13, 2024

dgl-bot commented Aug 13, 2024

dgl-bot commented Aug 13, 2024

dgl-bot commented Aug 13, 2024

dgl-bot commented Aug 13, 2024

dgl-bot commented Aug 14, 2024

dgl-bot commented Aug 14, 2024

dgl-bot commented Aug 14, 2024

frozenbugs commented Aug 20, 2024

[GraphBolt][CUDA] Async sample neighbors and compaction. #7682

[GraphBolt][CUDA] Async sample neighbors and compaction. #7682

Conversation

mfbalin commented Aug 11, 2024 • edited Loading

Description

Without torch.compile

With torch.compile:

Checklist

Changes

dgl-bot commented Aug 11, 2024

dgl-bot commented Aug 11, 2024

dgl-bot commented Aug 12, 2024

dgl-bot commented Aug 12, 2024

dgl-bot commented Aug 12, 2024

dgl-bot commented Aug 12, 2024

dgl-bot commented Aug 13, 2024

dgl-bot commented Aug 13, 2024

dgl-bot commented Aug 13, 2024

dgl-bot commented Aug 13, 2024

dgl-bot commented Aug 13, 2024

dgl-bot commented Aug 13, 2024

dgl-bot commented Aug 14, 2024

dgl-bot commented Aug 14, 2024

dgl-bot commented Aug 14, 2024

frozenbugs commented Aug 20, 2024

mfbalin commented Aug 11, 2024 •

edited

Loading