[GraphBolt] Add optimized `unique_and_compact_batched`. #7239

mfbalin · 2024-03-24T03:59:43Z

Description

Unique and compact has GPU synchronizations. When we call it separately for each etype, it slows down a lot. I make it batched so that the synchronizations are shared across etypes. I added an actual batched algorithm based on hash tables (Should have exact same output as CPU version). However, the map based code can only run on CUDA compute capability >= 70. That is why, we keep the batched algorithm based on sorting and enable the map based code only for newer GPUs. To compile the newly added map code only for new GPU architectures, we create a CUDA extension library that we link to GraphBolt.

With #7264 and this PR, we should be officially faster than DGL for every use case whether it is puregpu or UVA etc.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
I've leverage the tools to beautify the python and c++ code.
The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
Related issue is referred in this PR
If the PR is for a new model/paper, I've updated the example index here.

Changes

dgl-bot · 2024-03-24T04:00:12Z

To trigger regression tests:

@dgl-bot run [instance-type] [which tests] [compare-with-branch];
For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot · 2024-03-24T04:09:11Z

Commit ID: 8a7610b

Build ID: 1

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-03-24T04:29:37Z

Commit ID: 1e9c1cb

Build ID: 2

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-03-24T05:19:17Z

Commit ID: 84f65fe

Build ID: 3

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-03-26T04:07:14Z

Commit ID: eee35d6

Build ID: 4

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-03-28T19:04:28Z

Commit ID: ddbcd91968682eb8f06a75729503ff57e4c9c886

Build ID: 5

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

mfbalin · 2024-03-28T19:04:31Z

@TristonC is my dispatch mechanism correct so that I can run one code for Compute Capability >= 70 and another otherwise.

graphbolt/src/cuda/unique_and_compact_impl.cu

dgl-bot · 2024-03-28T19:29:44Z

Commit ID: 82d52eb

Build ID: 6

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

dgl-bot · 2024-03-28T20:15:21Z

Commit ID: cbea9e6

Build ID: 7

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

dgl-bot · 2024-03-28T21:17:47Z

Commit ID: 1ae18e1

Build ID: 8

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

…GraphBolt.

mfbalin · 2024-04-02T22:05:18Z

@Rhett-Ying CI failure:

ests/distributed/test_partition.py::test_partition_graph_graphbolt_homo[True-False-True-False-4-random] Converting to homogeneous graph takes 0.001s, peak mem: 3.929 GB
Reshuffle nodes and edges: 0.001 seconds
Split the graph: 0.003 seconds
Fatal Python error: Segmentation fault

Thread 0x00007f2c0475e700 (most recent call first):
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 324 in wait
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 607 in wait
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f2c33492800 (most recent call first):
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7239@2/python/dgl/backend/pytorch/tensor.py", line 126 in astype
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7239@2/python/dgl/partition.py", line 223 in create_subgraph
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7239@2/python/dgl/partition.py", line 239 in partition_graph_with_halo
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7239@2/python/dgl/distributed/partition.py", line 1016 in partition_graph
  File "/home/ubuntu/jenkins/workspace/dgl_PR-7239@2/tests/distributed/test_partition.py", line 1051 in test_partition_graph_graphbolt_homo
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/python.py", line 194 in pytest_pyfunc_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/python.py", line 1831 in runtest
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 170 in pytest_runtest_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 263 in &lt;lambda&gt;
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 342 in from_call
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 262 in call_runtest_hook
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 223 in call_and_report
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 134 in runtestprotocol
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/runner.py", line 115 in pytest_runtest_protocol
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 352 in pytest_runtestloop
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 327 in _main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 273 in wrap_session
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/main.py", line 320 in pytest_cmdline_main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_callers.py", line 102 in _multicall
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_manager.py", line 119 in _hookexec
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pluggy/_hooks.py", line 501 in __call__
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/config/__init__.py", line 175 in main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/_pytest/config/__init__.py", line 198 in console_main
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/pytest/__main__.py", line 7 in &lt;module&gt;
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/runpy.py", line 86 in _run_code
  File "/opt/conda/envs/pytorch-ci/lib/python3.10/runpy.py", line 196 in _run_module_as_main

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, dgl._ffi._cy3.core, scipy._lib._ccallback_c, scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._flinalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_sqrtm_triu, scipy.linalg.cython_blas, scipy.linalg._matfuncs_expm, scipy.linalg._decomp_update, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpack, scipy.sparse.csgraph._tools, scipy.sparse.csgraph._shortest_path, scipy.sparse.csgraph._traversal, scipy.sparse.csgraph._min_spanning_tree, scipy.sparse.csgraph._flow, scipy.sparse.csgraph._matching, scipy.sparse.csgraph._reordering, psutil._psutil_linux, psutil._psutil_posix, pyarrow.lib, pyarrow._hdfsio, pyarrow._parquet, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, scipy.io.matlab._mio_utils, scipy.io.matlab._streams, scipy.io.matlab._mio5_utils, scipy.spatial._ckdtree, scipy._lib.messagestream, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._distance_wrap, scipy.spatial._hausdorff, scipy.special._ufuncs_cxx, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.special._ellip_harm_2, scipy.spatial.transform._rotation, yaml._yaml (total: 111)
tests/scripts/task_distributed_test.sh: line 37:   649 Segmentation fault      (core dumped) python3 -m pytest -v --capture=tee-sys --junitxml=pytest_distributed.xml --durations=100 tests/distributed/*.py

mfbalin · 2024-04-02T22:05:26Z

@dgl-bot

dgl-bot · 2024-04-02T22:06:23Z

Commit ID: da15251

Build ID: 36

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-04-02T22:55:16Z

Commit ID: da15251

Build ID: 37

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-04-03T00:49:01Z

Commit ID: b930b0f

Build ID: 38

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

graphbolt/src/cuda/unique_and_compact_impl.cu

…gorithm.

dgl-bot · 2024-04-03T07:24:38Z

Commit ID: 978811e796848cd0a9a4e85660bd8fd5cda6dd81

Build ID: 39

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-04-03T08:15:49Z

Commit ID: 899d30e

Build ID: 40

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-04-03T09:37:11Z

Commit ID: 2d574b932a4586be94326bfde3587e935d1a81e6

Build ID: 41

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot · 2024-04-07T06:40:06Z

Commit ID: 85a1cf2

Build ID: 42

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot · 2024-04-07T07:34:16Z

Commit ID: 2b22578

Build ID: 43

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

graphbolt/src/cuda/unique_and_compact_impl.cu

graphbolt/src/cuda/extension/unique_and_compact_map.cu

mfbalin · 2024-04-07T08:03:29Z

If we need to dispatch another kernel w.r.t. the GPU's compute capability in the future similar to how we did here, we can refactor the logic in the code. Let's keep it in mind.

dgl-bot · 2024-04-07T08:54:02Z

Commit ID: 8a547ca

Build ID: 44

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

mfbalin · 2024-04-07T08:55:01Z

Merging this PR. Will monitor the regression tests to see how much it helped. Feel free to comment, suggest improvements here. I will make another PR to address them.

mfbalin added 2 commits March 23, 2024 23:58

[GraphBolt][CUDA] Add batched unique_and_compact API.

a2d8d39

add python binding

8a7610b

mfbalin requested review from frozenbugs and peizhou001 March 24, 2024 04:00

take back debug dispatch removal.

1e9c1cb

use the batched API from python.

84f65fe

mfbalin linked an issue Mar 24, 2024 that may be closed by this pull request

[GraphBolt] Make unique_and_compact batched. #7233

Closed

mfbalin removed a link to an issue Mar 24, 2024

[GraphBolt] Make unique_and_compact batched. #7233

Closed

mfbalin mentioned this pull request Mar 24, 2024

[GraphBolt] Make unique_and_compact batched. #7233

Closed

Merge branch 'master' into gb_batched_unique_and_compact

eee35d6

add actual map based batched implementation.

1c19c06

mfbalin requested a review from TristonC March 28, 2024 19:03

Merge branch 'master' into gb_batched_unique_and_compact

82d52eb

mfbalin commented Mar 28, 2024

View reviewed changes

graphbolt/src/cuda/unique_and_compact_impl.cu Show resolved Hide resolved

implement feature as it was added in torch 2.1 release.

cbea9e6

avoid using torch::Tensor::to as it synchronizes.

1ae18e1

mfbalin changed the title ~~[GraphBolt] Add unique_and_compact_batched.~~ [GraphBolt] Add optimized unique_and_compact_batched. Mar 29, 2024

mfbalin mentioned this pull request Mar 29, 2024

Replace uses of __CUDA_ARCH__ and __NVCOMPILER_CUDA_ARCH__ for compile time target version checks NVIDIA/cccl#976

Open

Make the CI pass by dropping support for old CUDA architectures from …

3fd7efd

…GraphBolt.

mfbalin force-pushed the gb_batched_unique_and_compact branch from 6fea3d2 to 3fd7efd Compare March 31, 2024 04:10

Merge branch 'master' into gb_batched_unique_and_compact

da15251

minor code style change in how map constructed.

b930b0f

frozenbugs reviewed Apr 3, 2024

View reviewed changes

graphbolt/src/cuda/unique_and_compact_impl.cu Show resolved Hide resolved

Add explanation on the difference between map based and sort based al…

7886f4b

…gorithm.

mfbalin requested a review from frozenbugs April 3, 2024 06:56

Merge branch 'master' into gb_batched_unique_and_compact

899d30e

mfbalin linked an issue Apr 6, 2024 that may be closed by this pull request

[GraphBolt][CUDA] Check if unique_and_compact can be optimized by using a hash table. #7174

Closed

mfbalin added 2 commits April 7, 2024 02:29

Merge branch 'master' into gb_batched_unique_and_compact

85a1cf2

Merge branch 'master' into gb_batched_unique_and_compact

2b22578

frozenbugs approved these changes Apr 7, 2024

View reviewed changes

graphbolt/src/cuda/unique_and_compact_impl.cu Outdated Show resolved Hide resolved

graphbolt/src/cuda/unique_and_compact_impl.cu Outdated Show resolved Hide resolved

graphbolt/src/cuda/extension/unique_and_compact_map.cu Outdated Show resolved Hide resolved

address reviews.

8a547ca

mfbalin merged commit 78df810 into dmlc:master Apr 7, 2024
2 checks passed

mfbalin deleted the gb_batched_unique_and_compact branch April 7, 2024 08:55

Rhett-Ying mentioned this pull request Apr 8, 2024

[build] nightly build is down due to gcc 8.5.0 does not fully support c++17 #7278

Closed

This was referenced Apr 8, 2024

[Build] CUDA atomics are only supported for sm_60 and up on *nix and sm_70 and up on Windows #7281

Closed

[CUDA][Bug] CSR transpose bug in CUDA 12 #7295

Merged

[GraphBolt][CUDA] hetero_rgcn example crashes #7296

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[GraphBolt] Add optimized `unique_and_compact_batched`. #7239

[GraphBolt] Add optimized `unique_and_compact_batched`. #7239

mfbalin commented Mar 24, 2024 •

edited

Loading

dgl-bot commented Mar 24, 2024

dgl-bot commented Mar 24, 2024

dgl-bot commented Mar 24, 2024

dgl-bot commented Mar 24, 2024

dgl-bot commented Mar 26, 2024

dgl-bot commented Mar 28, 2024

mfbalin commented Mar 28, 2024

dgl-bot commented Mar 28, 2024

dgl-bot commented Mar 28, 2024

dgl-bot commented Mar 28, 2024

mfbalin commented Apr 2, 2024

mfbalin commented Apr 2, 2024

dgl-bot commented Apr 2, 2024

dgl-bot commented Apr 2, 2024

dgl-bot commented Apr 3, 2024

dgl-bot commented Apr 3, 2024

dgl-bot commented Apr 3, 2024

dgl-bot commented Apr 3, 2024

dgl-bot commented Apr 7, 2024

dgl-bot commented Apr 7, 2024

mfbalin commented Apr 7, 2024

dgl-bot commented Apr 7, 2024

mfbalin commented Apr 7, 2024

[GraphBolt] Add optimized unique_and_compact_batched. #7239

[GraphBolt] Add optimized unique_and_compact_batched. #7239

Conversation

mfbalin commented Mar 24, 2024 • edited Loading

Description

Checklist

Changes

dgl-bot commented Mar 24, 2024

dgl-bot commented Mar 24, 2024

dgl-bot commented Mar 24, 2024

dgl-bot commented Mar 24, 2024

dgl-bot commented Mar 26, 2024

dgl-bot commented Mar 28, 2024

mfbalin commented Mar 28, 2024

dgl-bot commented Mar 28, 2024

dgl-bot commented Mar 28, 2024

dgl-bot commented Mar 28, 2024

mfbalin commented Apr 2, 2024

mfbalin commented Apr 2, 2024

dgl-bot commented Apr 2, 2024

dgl-bot commented Apr 2, 2024

dgl-bot commented Apr 3, 2024

dgl-bot commented Apr 3, 2024

dgl-bot commented Apr 3, 2024

dgl-bot commented Apr 3, 2024

dgl-bot commented Apr 7, 2024

dgl-bot commented Apr 7, 2024

mfbalin commented Apr 7, 2024

dgl-bot commented Apr 7, 2024

mfbalin commented Apr 7, 2024

[GraphBolt] Add optimized `unique_and_compact_batched`. #7239

[GraphBolt] Add optimized `unique_and_compact_batched`. #7239

mfbalin commented Mar 24, 2024 •

edited

Loading