Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt][CUDA] Incremental GPU graph cache into gb.Dataloader. #7475

Merged
merged 12 commits into from
Jun 26, 2024

Conversation

mfbalin
Copy link
Collaborator

@mfbalin mfbalin commented Jun 25, 2024

Description

Depends on #7470 and cpp changes were moved to #7483.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@mfbalin mfbalin requested a review from frozenbugs June 25, 2024 04:34
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 25, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 25, 2024

Commit ID: 40d6e841c8a3cf9db266bff53c2382eb840c4be2

Build ID: 1

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 25, 2024

Commit ID: 61294c44603fa891c6600b8199f45b6b295b10e1

Build ID: 2

Status: ❌ CI test failed in Stage [Torch CPU (Win64) Unit test].

Report path: link

Full logs path: link

python/dgl/graphbolt/dataloader.py Outdated Show resolved Hide resolved
python/dgl/graphbolt/impl/neighbor_sampler.py Outdated Show resolved Hide resolved
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: c21735d

Build ID: 3

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: ee84274

Build ID: 4

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

@mfbalin mfbalin added the Release Candidate Candidate PRs for the upcoming release label Jun 26, 2024
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: 54a594a

Build ID: 5

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@mfbalin mfbalin changed the title [GraphBolt][CUDA] Incremental GPU graph cache into gb.Dataloader. [GraphBolt][CUDA][DO NOT MERGE] Incremental GPU graph cache into gb.Dataloader. Jun 26, 2024
@mfbalin mfbalin changed the title [GraphBolt][CUDA][DO NOT MERGE] Incremental GPU graph cache into gb.Dataloader. [GraphBolt][CUDA] Incremental GPU graph cache into gb.Dataloader. Jun 26, 2024
@mfbalin mfbalin changed the title [GraphBolt][CUDA] Incremental GPU graph cache into gb.Dataloader. [GraphBolt][CUDA][DO NOT MERGE] Incremental GPU graph cache into gb.Dataloader. Jun 26, 2024
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: 71f4a6c

Build ID: 6

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: 48a2bf2

Build ID: 7

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: 7bdd685

Build ID: 8

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin mfbalin changed the title [GraphBolt][CUDA][DO NOT MERGE] Incremental GPU graph cache into gb.Dataloader. [GraphBolt][CUDA] Incremental GPU graph cache into gb.Dataloader. Jun 26, 2024
@mfbalin mfbalin requested a review from frozenbugs June 26, 2024 09:41
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: 44c0aa9

Build ID: 9

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jun 26, 2024

@dgl-bot
@Rhett-Ying CI still failing

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jun 26, 2024

@dgl-bot

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: 44c0aa9

Build ID: 10

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: ef03812

Build ID: 11

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: d543580

Build ID: 12

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: b89e81a8b672a6696861f700e2c22f9390973f74

Build ID: 13

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jun 26, 2024

@dgl-bot

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: ddb369b96b5e32a28a160f6745f3c89d80d02e57

Build ID: 14

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jun 26, 2024

@dgl-bot

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: c93ca18dc54a32aa4f0079707337a624235556ca

Build ID: 15

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jun 26, 2024

@dgl-bot

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jun 26, 2024

tests/distributed/test_partition.py::test_partition[None-True-1-4-metis] Converting to homogeneous graph takes 0.001s, peak mem: 2.438 GB
Convert a graph into a bidirected graph: 0.000 seconds, peak memory: 2.438 GB
Construct multi-constraint weights: 0.000 seconds, peak memory: 2.438 GB
Fatal Python error: Segmentation fault

Thread 0x00007f30deedf700 (most recent call first):
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 324 in wait
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 607 in wait
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f310bbe5800 (most recent call first):
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7475@2/python/dgl/partition.py", line 385 in metis_partition_assignment
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7475@2/python/dgl/distributed/partition.py", line 1001 in partition_graph
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7475@2/tests/distributed/test_partition.py", line 354 in check_partition
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7475@2/tests/distributed/test_partition.py", line 532 in test_partition
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/python.py", line 162 in pytest_pyfunc_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/python.py", line 1627 in runtest
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 173 in pytest_runtest_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 241 in <lambda>
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 341 in from_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 240 in call_and_report
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 135 in runtestprotocol
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 116 in pytest_runtest_protocol
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 364 in pytest_runtestloop
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 339 in _main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 285 in wrap_session
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 332 in pytest_cmdline_main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/config/__init__.py", line 178 in main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/config/__init__.py", line 206 in console_main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pytest/__main__.py", line 7 in <module>
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/runpy.py", line 86 in _run_code
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/runpy.py", line 196 in _run_module_as_main

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: 0d63781ba5ff33b3af3eb15d5a06a8bc004a4629

Build ID: 16

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin mfbalin merged commit 95dc96a into dmlc:master Jun 26, 2024
2 checks passed
@mfbalin mfbalin deleted the gb_cuda_gpu_graph_cache_py branch June 26, 2024 19:38
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Release Candidate Candidate PRs for the upcoming release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants