Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt][CUDA] Incremental GPU graph caching. #7470

Merged
merged 25 commits into from
Jun 26, 2024

Conversation

mfbalin
Copy link
Collaborator

@mfbalin mfbalin commented Jun 20, 2024

Description

Follow up PR will add the dataloader logic so that the cache can be utilized for faster GPU sampling.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@mfbalin mfbalin requested a review from frozenbugs June 20, 2024 20:38
@mfbalin mfbalin marked this pull request as draft June 20, 2024 20:38
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 20, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 20, 2024

Commit ID: 0c0f115

Build ID: 1

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 20, 2024

Commit ID: 2a74efc

Build ID: 2

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 20, 2024

Commit ID: 8f170b9

Build ID: 3

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 20, 2024

Commit ID: 66c9336

Build ID: 4

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@mfbalin mfbalin marked this pull request as ready for review June 20, 2024 23:57
@mfbalin
Copy link
Collaborator Author

mfbalin commented Jun 20, 2024

@frozenbugs let's land this PR and the followup PR before we release DGL 2.3. That way, we can say the new release has this major feature.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 20, 2024

Commit ID: bba8975

Build ID: 5

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 21, 2024

Commit ID: 58302d4

Build ID: 6

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 21, 2024

Commit ID: daa9d36

Build ID: 7

Status: ❌ CI test failed in Stage [Lint Check].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 21, 2024

Commit ID: 3366909

Build ID: 8

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 21, 2024

Commit ID: 58bfe3c

Build ID: 9

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 21, 2024

Commit ID: bade092

Build ID: 10

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jun 21, 2024

@dgl-bot

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 21, 2024

Commit ID: bade092

Build ID: 11

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 21, 2024

Commit ID: d8fa4f0

Build ID: 12

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 25, 2024

Commit ID: 8b58ed5ba6bb7727f3da01414b87512a2ea784ad

Build ID: 25

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 25, 2024

Commit ID: fba294d64051e40584ba8440572e95f26e100b39

Build ID: 26

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 25, 2024

Commit ID: 78eea0c056cee45c674e1e95a0a404f85236b890

Build ID: 27

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

graphbolt/src/cuda/extension/gpu_graph_cache.h Outdated Show resolved Hide resolved
graphbolt/src/cuda/extension/gpu_graph_cache.h Outdated Show resolved Hide resolved
python/dgl/graphbolt/impl/gpu_graph_cache.py Outdated Show resolved Hide resolved
graphbolt/src/cuda/extension/gpu_graph_cache.cu Outdated Show resolved Hide resolved
@mfbalin mfbalin requested a review from frozenbugs June 26, 2024 03:05
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: 7ff8213343bb3c1f5633bd3cfaee0b0b43e1380e

Build ID: 28

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: b2e10fc

Build ID: 29

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: bb18427

Build ID: 30

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@mfbalin mfbalin added the Release Candidate Candidate PRs for the upcoming release label Jun 26, 2024
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: bb18427

Build ID: 31

Status: ❌ CI test failed in Stage [Distributed Torch CPU Unit test].

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jun 26, 2024

@Rhett-Ying CI failure output:

tests/distributed/test_mp_dataloader.py::test_edge_dataloader_homograph[False-self-False-0] Fatal Python error: Segmentation fault

Thread 0x00007fb4b8fdf700 (most recent call first):
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 324 in wait
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 607 in wait
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007fb4e5ce6800 (most recent call first):
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/codecs.py", line 322 in decode
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7470@2/python/dgl/partition.py", line 271 in get_peak_mem
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7470@2/python/dgl/distributed/partition.py", line 921 in partition_graph
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7470@2/tests/distributed/test_mp_dataloader.py", line 770 in check_dataloader
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7470@2/tests/distributed/test_mp_dataloader.py", line 914 in test_edge_dataloader_homograph
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/python.py", line 162 in pytest_pyfunc_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/python.py", line 1627 in runtest
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 173 in pytest_runtest_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 241 in <lambda>
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 341 in from_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 240 in call_and_report
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 135 in runtestprotocol
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 116 in pytest_runtest_protocol
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 364 in pytest_runtestloop
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 339 in _main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 285 in wrap_session
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 332 in pytest_cmdline_main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 103 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 120 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 513 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/config/__init__.py", line 178 in main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/config/__init__.py", line 206 in console_main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pytest/__main__.py", line 7 in <module>
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/runpy.py", line 86 in _run_code
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/runpy.py", line 196 in _run_module_as_main

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, dgl._ffi._cy3.core, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.linalg._propack._spropack, scipy.sparse.linalg._propack._dpropack, scipy.sparse.linalg._propack._cpropack, scipy.sparse.linalg._propack._zpropack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, psutil._psutil_linux, psutil._psutil_posix, pyarrow.lib, pyarrow._parquet, pyarrow._fs, pyarrow._azurefs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._cdflib, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, yaml._yaml (total: 115)
tests/scripts/task_distributed_test.sh: line 37:   638 Segmentation fault      (core dumped) python3 -m pytest -v --capture=tee-sys --junitxml=pytest_distributed.xml --durations=100 tests/distributed/*.py
FAIL: distributed

@Rhett-Ying
Copy link
Collaborator

@mfbalin pls skip it. It's known issue

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jun 26, 2024

@mfbalin pls skip it. It's known issue

I can not skip it. I don't have the permissions to merge into master when the CI does not clear.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jun 26, 2024

Commit ID: 5ed6d0c

Build ID: 32

Status: ❌ CI test failed in Stage [Torch CPU Example test].

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jun 26, 2024

another test failure:

+ bash tests/scripts/task_example_test.sh cpu
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.2.0, pluggy-1.5.0 -- /opt/conda/envs/pytorch-ci/bin/python3
cachedir: .pytest_cache
rootdir: /home/ubuntu/jenkins/workspace/dgl_PR-7470
configfile: pyproject.toml
collecting ... collected 12 items

tests/examples/test_sampling_examples.py::test_node_classification PASSED [  8%]
tests/examples/test_sampling_examples.py::test_link_prediction PASSED    [ 16%]
tests/examples/test_sparse_examples.py::test_gcn PASSED                  [ 25%]
tests/examples/test_sparse_examples.py::test_gcnii PASSED                [ 33%]
tests/examples/test_sparse_examples.py::test_appnp PASSED                [ 41%]
tests/examples/test_sparse_examples.py::test_c_and_s PASSED              [ 50%]
tests/examples/test_sparse_examples.py::test_gat PASSED                  [ 58%]
tests/examples/test_sparse_examples.py::test_hgnn PASSED                 [ 66%]
tests/examples/test_sparse_examples.py::test_hypergraphatt PASSED        [ 75%]
tests/examples/test_sparse_examples.py::test_sgc PASSED                  [ 83%]
tests/examples/test_sparse_examples.py::test_sign FAILED                 [ 91%]
tests/examples/test_sparse_examples.py::test_twirls PASSED               [100%]

=================================== FAILURES ===================================
__________________________________ test_sign ___________________________________

    def test_sign():
        script = os.path.join(EXAMPLE_ROOT, "sign.py")
        out = subprocess.run(["python", str(script)], capture_output=True)
        assert (
            out.returncode == 0
        ), f"stdout: {out.stdout.decode('utf-8')}\nstderr: {out.stderr.decode('utf-8')}"
        stdout = out.stdout.decode("utf-8")
>       assert float(stdout[-5:]) > 0.7
E       AssertionError: assert 0.697 > 0.7
E        +  where 0.697 = float('.697\n')

tests/examples/test_sparse_examples.py:101: AssertionError
- generated xml file: /home/ubuntu/jenkins/workspace/dgl_PR-7470/pytest_backend.xml -
============================ slowest 100 durations =============================
34.74s call     tests/examples/test_sparse_examples.py::test_twirls
21.53s call     tests/examples/test_sparse_examples.py::test_gcnii
9.20s call     tests/examples/test_sparse_examples.py::test_hypergraphatt
8.06s call     tests/examples/test_sampling_examples.py::test_node_classification
7.80s call     tests/examples/test_sampling_examples.py::test_link_prediction
6.16s call     tests/examples/test_sparse_examples.py::test_hgnn
3.27s call     tests/examples/test_sparse_examples.py::test_appnp
3.22s call     tests/examples/test_sparse_examples.py::test_gat
3.18s call     tests/examples/test_sparse_examples.py::test_gcn
2.82s call     tests/examples/test_sparse_examples.py::test_sign
2.61s call     tests/examples/test_sparse_examples.py::test_sgc
2.55s call     tests/examples/test_sparse_examples.py::test_c_and_s

(24 durations < 0.005s hidden.  Use -vv to show these durations.)
=========================== short test summary info ============================
FAILED tests/examples/test_sparse_examples.py::test_sign - AssertionError: assert 0.697 > 0.7
 +  where 0.697 = float('.697\n')
=================== 1 failed, 11 passed in 105.21s (0:01:45) ===================
FAIL: sparse examples on cpu

@Rhett-Ying
Copy link
Collaborator

I've just made a change on CI and rebased this PR. let's see if it works well now. If not, I will merge it if other CI tests pass.

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jun 26, 2024

I've just made a change on CI and rebased this PR. let's see if it works well now. If not, I will merge it if other CI tests pass.

The CI already passed: #7470 (comment)
The only change I made is to improve a comment in the code. I would appreciate it if you merged it now.

@Rhett-Ying Rhett-Ying merged commit c822bc1 into master Jun 26, 2024
1 of 2 checks passed
@Rhett-Ying Rhett-Ying deleted the gb_cuda_gpu_graph_cache branch June 26, 2024 05:26
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Release Candidate Candidate PRs for the upcoming release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants