Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] Add optimized unique_and_compact_batched. #7239

Merged
merged 42 commits into from
Apr 7, 2024

Conversation

mfbalin
Copy link
Collaborator

@mfbalin mfbalin commented Mar 24, 2024

Description

Unique and compact has GPU synchronizations. When we call it separately for each etype, it slows down a lot. I make it batched so that the synchronizations are shared across etypes. I added an actual batched algorithm based on hash tables (Should have exact same output as CPU version). However, the map based code can only run on CUDA compute capability >= 70. That is why, we keep the batched algorithm based on sorting and enable the map based code only for newer GPUs. To compile the newly added map code only for new GPU architectures, we create a CUDA extension library that we link to GraphBolt.

With #7264 and this PR, we should be officially faster than DGL for every use case whether it is puregpu or UVA etc.

image

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 24, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 24, 2024

Commit ID: 8a7610b

Build ID: 1

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 24, 2024

Commit ID: 1e9c1cb

Build ID: 2

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@mfbalin mfbalin linked an issue Mar 24, 2024 that may be closed by this pull request
@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 24, 2024

Commit ID: 84f65fe

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 26, 2024

Commit ID: eee35d6

Build ID: 4

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin mfbalin requested a review from TristonC March 28, 2024 19:03
@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 28, 2024

Commit ID: ddbcd91968682eb8f06a75729503ff57e4c9c886

Build ID: 5

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Mar 28, 2024

@TristonC is my dispatch mechanism correct so that I can run one code for Compute Capability >= 70 and another otherwise.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 28, 2024

Commit ID: 82d52eb

Build ID: 6

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 28, 2024

Commit ID: cbea9e6

Build ID: 7

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Mar 28, 2024

Commit ID: 1ae18e1

Build ID: 8

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

@mfbalin mfbalin changed the title [GraphBolt] Add unique_and_compact_batched. [GraphBolt] Add optimized unique_and_compact_batched. Mar 29, 2024
@mfbalin mfbalin force-pushed the gb_batched_unique_and_compact branch from 6fea3d2 to 3fd7efd Compare March 31, 2024 04:10
@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 2, 2024

@Rhett-Ying CI failure:

ests/distributed/test_partition.py::test_partition_graph_graphbolt_homo[True-False-True-False-4-random] Converting to homogeneous graph takes 0.001s, peak mem: 3.929 GB
Reshuffle nodes and edges: 0.001 seconds
Split the graph: 0.003 seconds
Fatal Python error: Segmentation fault

Thread 0x00007f2c0475e700 (most recent call first):
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 324 in wait
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 607 in wait
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f2c33492800 (most recent call first):
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7239@2/python/dgl/backend/pytorch/tensor.py", line 126 in astype
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7239@2/python/dgl/partition.py", line 223 in create_subgraph
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7239@2/python/dgl/partition.py", line 239 in partition_graph_with_halo
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7239@2/python/dgl/distributed/partition.py", line 1016 in partition_graph
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7239@2/tests/distributed/test_partition.py", line 1051 in test_partition_graph_graphbolt_homo
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/python.py", line 1831 in runtest
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 170 in pytest_runtest_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 263 in <lambda>
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 342 in from_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 262 in call_runtest_hook
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 223 in call_and_report
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 134 in runtestprotocol
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 115 in pytest_runtest_protocol
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 352 in pytest_runtestloop
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 327 in _main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 273 in wrap_session
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 320 in pytest_cmdline_main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/config/__init__.py", line 175 in main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/config/__init__.py", line 198 in console_main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pytest/__main__.py", line 7 in <module>
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/runpy.py", line 86 in _run_code
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/runpy.py", line 196 in _run_module_as_main

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, dgl._ffi._cy3.core, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, psutil._psutil_linux, psutil._psutil_posix, pyarrow.lib, pyarrow._hdfsio, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, yaml._yaml (total: 111)
tests/scripts/task_distributed_test.sh: line 37:   649 Segmentation fault      (core dumped) python3 -m pytest -v --capture=tee-sys --junitxml=pytest_distributed.xml --durations=100 tests/distributed/*.py

@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 2, 2024

@dgl-bot

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 2, 2024

Commit ID: da15251

Build ID: 36

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 2, 2024

Commit ID: da15251

Build ID: 37

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 3, 2024

Commit ID: b930b0f

Build ID: 38

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin mfbalin requested a review from frozenbugs April 3, 2024 06:56
@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 3, 2024

Commit ID: 978811e796848cd0a9a4e85660bd8fd5cda6dd81

Build ID: 39

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 3, 2024

Commit ID: 899d30e

Build ID: 40

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 3, 2024

Commit ID: 2d574b932a4586be94326bfde3587e935d1a81e6

Build ID: 41

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 7, 2024

Commit ID: 85a1cf2

Build ID: 42

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 7, 2024

Commit ID: 2b22578

Build ID: 43

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

graphbolt/src/cuda/unique_and_compact_impl.cu Outdated Show resolved Hide resolved
graphbolt/src/cuda/unique_and_compact_impl.cu Outdated Show resolved Hide resolved
graphbolt/src/cuda/extension/unique_and_compact_map.cu Outdated Show resolved Hide resolved
@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 7, 2024

If we need to dispatch another kernel w.r.t. the GPU's compute capability in the future similar to how we did here, we can refactor the logic in the code. Let's keep it in mind.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Apr 7, 2024

Commit ID: 8a547ca

Build ID: 44

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Apr 7, 2024

Merging this PR. Will monitor the regression tests to see how much it helped. Feel free to comment, suggest improvements here. I will make another PR to address them.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants