[Performance][CUDA] Faster CSRToCOO #5648

mfbalin · 2023-05-02T20:58:11Z

Description

CUB recently implemented a DeviceCopy::Batched algorithm, which can be used to implement the DeviceRunLength::Decode algorithm, which in turn can be used to implement the CSRToCOO for the GPU in a more efficient way for the 64-bit code path. The current implementation suffers from O(E log N) complexity, which this PR reduces to O(E).

My motivation was that CSRToCOO is called inside the labor sampler algorithm, so improving the performance will also improve Labor Sampler and make its complexity truly linear in the number of edges.

My experimental evaluations show that the Labor Sampler runtime consistently improves with this change.

Checklist

Please feel free to remove inapplicable items for your PR.

The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
I've leverage the tools to beautify the python and c++ code.
The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
All changes have test coverage
Code is well-documented
To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

…ter_csr_to_coo

dgl-bot · 2023-05-02T20:58:41Z

Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:

@dgl-bot

mfbalin · 2023-05-02T21:02:52Z

@BarclayII @yaox12, could you take a look? This PR requires using the latest thrust and cub versions.

yaox12 · 2023-05-04T01:59:08Z

@mfbalin

Has this feature been included in any CUB release version?
CUB 2.x introduced the libcu++ dependency. Do you know if it's compatible with CUDA 10.2, since DGL still supports it?

yaox12 · 2023-05-04T02:00:27Z

@dgl-bot

mfbalin · 2023-05-04T02:15:51Z

@yaox12

Not yet, I just recently contributed DeviceCopy to CUB in this PR, I guess it will be included in the next release. I don't know the timeline of the next release though.
It looks like thrust includes cub and libcu++ as submodule dependencies however I don't know if the libcu++ version that comes with thrust is compatible with CUDA 10.2. What I was able to find is that libcu++ was first released for CUDA 10.2. I don't know if they dropped CUDA 10.2 support in later releases.

dgl-bot · 2023-05-04T02:18:55Z

Commit ID: 92b0b17

Build ID: 3

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

yaox12 · 2023-05-04T03:34:31Z

@dgl-bot

dgl-bot · 2023-05-04T03:52:59Z

Commit ID: 855f011

Build ID: 5

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

mfbalin · 2023-05-04T04:07:14Z

It looks like the libcudacxx headers can not be found even though I added it to the cuda include directories, what else needs to be done so that those headers can be found?

Error msg: /root/jenkins/workspace/dgl_PR-5648@2/third_party/thrust/thrust/detail/type_traits.h:27:10: fatal error: cuda/std/type_traits: No such file or directory

Edit: added include to the include path for libcudacxx, I hope it will work now.

yaox12 · 2023-05-04T05:25:13Z

@dgl-bot

dgl-bot · 2023-07-12T07:35:02Z

Commit ID: 078be43

Build ID: 8

Status: ❌ CI test failed in Stage [Authentication].

Report path: link

Full logs path: link

yaox12 · 2023-07-12T08:43:41Z

@dgl-bot

dgl-bot · 2023-07-12T10:41:22Z

Commit ID: 078be43

Build ID: 9

Status: ❌ CI test failed in Stage [C++ GPU].

Report path: link

Full logs path: link

src/array/cuda/csr2coo.cu

CMakeLists.txt

mfbalin · 2023-07-13T03:19:47Z

I am not sure if this new code path or the existing int32_t specialization is faster. One would need to do a benchmark to see if the cusparse csr2coo or cub is better. However, for the int64_t specialization, CUB should outperform the existing implementation in all cases.
Cusparse code path:

dgl/src/array/cuda/csr2coo.cu

Line 44 in 078be43

CUSPARSE_CALL(cusparseXcsr2coo(

dgl-bot · 2023-07-13T15:29:48Z

Commit ID: 02bdcb5

Build ID: 10

Status: ❌ CI test failed in Stage [C++ GPU].

Report path: link

Full logs path: link

frozenbugs · 2023-07-14T01:16:10Z

@mfbalin can you help us take a look at the failing test?

dgl-bot · 2023-07-14T02:15:27Z

Commit ID: befcf8a

Build ID: 11

Status: ❌ CI test failed in Stage [C++ GPU].

Report path: link

Full logs path: link

mfbalin · 2023-07-14T03:39:17Z

@frozenbugs Partition had a minor bug due to the 1 off specification of histogramEven call parameters. Tests should all pass now.

dgl-bot · 2023-07-14T03:40:33Z

Commit ID: 8915f8c868838cf41b01cae6097ba88c51302698

Build ID: 12

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

mfbalin · 2023-07-14T03:41:14Z

Actually, we might want to fix this bug in a separate PR so that if there is any problem with the new thrust version (on the off change), we can take this PR back but the bug fix is still stays. see: #6001

frozenbugs · 2023-07-14T03:45:15Z

Actually, we might want to fix this bug in a separate PR so that if there is any problem with the new thrust version (on the off change), we can take this PR back but the bug fix is still stays.

Solid judgement :)

frozenbugs · 2023-07-14T03:49:49Z

Partition had a minor bug due to the 1 off specification of histogramEven call parameters. Tests should all pass now.

Thanks for fixing, any reason this was not exposed without your optimization?

mfbalin · 2023-07-14T03:55:50Z

The histogram has bins, let's say we are computing a histogram for bins with values 0 to 6 (7 bins in total). Without the fix, the bin boundaries were [0, 8) (8 exclusive) vs [0, 7). Documentation says that CUB evenly distributed the values in the range to bins. But 8 doesn't divide 7. So without the fix, with the new update, 0 and 1 was in the first bin. Without the update 6 and 7 were in the last bin (Which didn't expose the bug). I am speculating that this caused a rounding error and a change in the implementation in CUB might have exposed this bug.

dgl-bot · 2023-07-14T04:37:37Z

Commit ID: 406ec17

Build ID: 13

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>

This reverts commit a72284d.

Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>

mfbalin added 7 commits April 7, 2023 21:32

implement csr2coo with devicememcpy::batched

3ff3e54

switch to using DeviceCopy

093d24d

apply clang-format

5c395c7

make it work for 64 bit

431cd5e

Merge branch 'dmlc:master' into faster_csr_to_coo

0ae9884

update thrust version

5b3b8d7

Merge branch 'faster_csr_to_coo' of github.com:mfbalin/dgl-1 into fas…

b999204

…ter_csr_to_coo

This comment was marked as outdated.

Sign in to view

mfbalin changed the title ~~[CUDA] Faster csr to coo for~~ [CUDA] Faster csr to coo May 2, 2023

mfbalin changed the title ~~[CUDA] Faster csr to coo~~ [CUDA] Faster CSRToCOO May 2, 2023

mfbalin changed the title ~~[CUDA] Faster CSRToCOO~~ [Performance][CUDA] Faster CSRToCOO May 2, 2023

cleanup existing implementation

92b0b17

This comment was marked as outdated.

Sign in to view

add libcudacxx to the cuda include directory

a43cca0

mfbalin force-pushed the faster_csr_to_coo branch from 855f011 to a43cca0 Compare May 4, 2023 05:03

This comment was marked as outdated.

Sign in to view

czkkkkkk reviewed Jul 13, 2023

View reviewed changes

src/array/cuda/csr2coo.cu Show resolved Hide resolved

czkkkkkk reviewed Jul 13, 2023

View reviewed changes

CMakeLists.txt Show resolved Hide resolved

czkkkkkk approved these changes Jul 13, 2023

View reviewed changes

Merge branch 'master' into faster_csr_to_coo

02bdcb5

Merge branch 'master' into faster_csr_to_coo

befcf8a

fix partition bug

995b80f

Merge branch 'master' into faster_csr_to_coo

406ec17

Merge branch 'master' into faster_csr_to_coo

b1b1404

frozenbugs merged commit 8311579 into dmlc:master Jul 14, 2023
1 check passed

Rhett-Ying pushed a commit that referenced this pull request Aug 10, 2023

[Performance][CUDA] Faster CSRToCOO (#5648)

a72284d

Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>

Rhett-Ying mentioned this pull request Aug 11, 2023

[Build] Failed to build labor sampling related code on windows_cu118 #6135

Closed

This was referenced Aug 11, 2023

[dev] fix build error on windows+cu118 #6139

Closed

[Dev] Change CXX standard to 17 #6138

Merged

Rhett-Ying added a commit that referenced this pull request Aug 14, 2023

Revert "[Performance][CUDA] Faster CSRToCOO (#5648)"

7ad7e29

This reverts commit a72284d.

Rhett-Ying added a commit that referenced this pull request Aug 14, 2023

[release] 1.1.x revert #5648 (#6150)

b64940b

DominikaJedynak pushed a commit to DominikaJedynak/dgl that referenced this pull request Mar 12, 2024

[Performance][CUDA] Faster CSRToCOO (dmlc#5648)

b32a0ee

Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance][CUDA] Faster CSRToCOO #5648

[Performance][CUDA] Faster CSRToCOO #5648

mfbalin commented May 2, 2023 •

edited

Loading

dgl-bot commented May 2, 2023

This comment was marked as outdated.

mfbalin commented May 2, 2023

This comment was marked as outdated.

This comment was marked as outdated.

yaox12 commented May 4, 2023 •

edited

Loading

yaox12 commented May 4, 2023

mfbalin commented May 4, 2023

dgl-bot commented May 4, 2023

This comment was marked as outdated.

This comment was marked as outdated.

yaox12 commented May 4, 2023

dgl-bot commented May 4, 2023

mfbalin commented May 4, 2023 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

yaox12 commented May 4, 2023

dgl-bot commented Jul 12, 2023

yaox12 commented Jul 12, 2023

dgl-bot commented Jul 12, 2023

mfbalin commented Jul 13, 2023 •

edited

Loading

dgl-bot commented Jul 13, 2023

frozenbugs commented Jul 14, 2023

dgl-bot commented Jul 14, 2023

mfbalin commented Jul 14, 2023 •

edited

Loading

dgl-bot commented Jul 14, 2023

mfbalin commented Jul 14, 2023 •

edited

Loading

frozenbugs commented Jul 14, 2023

frozenbugs commented Jul 14, 2023

mfbalin commented Jul 14, 2023

dgl-bot commented Jul 14, 2023

[Performance][CUDA] Faster CSRToCOO #5648

[Performance][CUDA] Faster CSRToCOO #5648

Conversation

mfbalin commented May 2, 2023 • edited Loading

Description

Checklist

dgl-bot commented May 2, 2023

This comment was marked as outdated.

mfbalin commented May 2, 2023

This comment was marked as outdated.

This comment was marked as outdated.

yaox12 commented May 4, 2023 • edited Loading

yaox12 commented May 4, 2023

mfbalin commented May 4, 2023

dgl-bot commented May 4, 2023

This comment was marked as outdated.

This comment was marked as outdated.

yaox12 commented May 4, 2023

dgl-bot commented May 4, 2023

mfbalin commented May 4, 2023 • edited Loading

This comment was marked as outdated.

This comment was marked as outdated.

yaox12 commented May 4, 2023

dgl-bot commented Jul 12, 2023

yaox12 commented Jul 12, 2023

dgl-bot commented Jul 12, 2023

mfbalin commented Jul 13, 2023 • edited Loading

dgl-bot commented Jul 13, 2023

frozenbugs commented Jul 14, 2023

dgl-bot commented Jul 14, 2023

mfbalin commented Jul 14, 2023 • edited Loading

dgl-bot commented Jul 14, 2023

mfbalin commented Jul 14, 2023 • edited Loading

frozenbugs commented Jul 14, 2023

frozenbugs commented Jul 14, 2023

mfbalin commented Jul 14, 2023

dgl-bot commented Jul 14, 2023

mfbalin commented May 2, 2023 •

edited

Loading

yaox12 commented May 4, 2023 •

edited

Loading

mfbalin commented May 4, 2023 •

edited

Loading

mfbalin commented Jul 13, 2023 •

edited

Loading

mfbalin commented Jul 14, 2023 •

edited

Loading

mfbalin commented Jul 14, 2023 •

edited

Loading