Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance][CUDA] Faster CSRToCOO #5648

Merged
merged 15 commits into from
Jul 14, 2023
Merged

Conversation

mfbalin
Copy link
Collaborator

@mfbalin mfbalin commented May 2, 2023

Description

CUB recently implemented a DeviceCopy::Batched algorithm, which can be used to implement the DeviceRunLength::Decode algorithm, which in turn can be used to implement the CSRToCOO for the GPU in a more efficient way for the 64-bit code path. The current implementation suffers from O(E log N) complexity, which this PR reduces to O(E).

My motivation was that CSRToCOO is called inside the labor sampler algorithm, so improving the performance will also improve Labor Sampler and make its complexity truly linear in the number of edges.

My experimental evaluations show that the Labor Sampler runtime consistently improves with this change.

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 2, 2023

Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:

  • @dgl-bot

@dgl-bot

This comment was marked as outdated.

@mfbalin mfbalin changed the title [CUDA] Faster csr to coo for [CUDA] Faster csr to coo May 2, 2023
@mfbalin
Copy link
Collaborator Author

mfbalin commented May 2, 2023

@BarclayII @yaox12, could you take a look? This PR requires using the latest thrust and cub versions.

@mfbalin mfbalin changed the title [CUDA] Faster csr to coo [CUDA] Faster CSRToCOO May 2, 2023
@mfbalin mfbalin changed the title [CUDA] Faster CSRToCOO [Performance][CUDA] Faster CSRToCOO May 2, 2023
@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@yaox12
Copy link
Collaborator

yaox12 commented May 4, 2023

@mfbalin

  1. Has this feature been included in any CUB release version?
  2. CUB 2.x introduced the libcu++ dependency. Do you know if it's compatible with CUDA 10.2, since DGL still supports it?

@yaox12
Copy link
Collaborator

yaox12 commented May 4, 2023

@dgl-bot

@mfbalin
Copy link
Collaborator Author

mfbalin commented May 4, 2023

@yaox12

  1. Not yet, I just recently contributed DeviceCopy to CUB in this PR, I guess it will be included in the next release. I don't know the timeline of the next release though.
  2. It looks like thrust includes cub and libcu++ as submodule dependencies however I don't know if the libcu++ version that comes with thrust is compatible with CUDA 10.2. What I was able to find is that libcu++ was first released for CUDA 10.2. I don't know if they dropped CUDA 10.2 support in later releases.

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 4, 2023

Commit ID: 92b0b17

Build ID: 3

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@yaox12
Copy link
Collaborator

yaox12 commented May 4, 2023

@dgl-bot

@dgl-bot
Copy link
Collaborator

dgl-bot commented May 4, 2023

Commit ID: 855f011

Build ID: 5

Status: ❌ CI test failed in Stage [GPU Build].

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented May 4, 2023

It looks like the libcudacxx headers can not be found even though I added it to the cuda include directories, what else needs to be done so that those headers can be found?

Error msg: /root/jenkins/workspace/dgl_PR-5648@2/third_party/thrust/thrust/detail/type_traits.h:27:10: fatal error: cuda/std/type_traits: No such file or directory

Edit: added include to the include path for libcudacxx, I hope it will work now.

@dgl-bot

This comment was marked as outdated.

@dgl-bot

This comment was marked as outdated.

@yaox12
Copy link
Collaborator

yaox12 commented May 4, 2023

@dgl-bot

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 12, 2023

Commit ID: 078be43

Build ID: 8

Status: ❌ CI test failed in Stage [Authentication].

Report path: link

Full logs path: link

@yaox12
Copy link
Collaborator

yaox12 commented Jul 12, 2023

@dgl-bot

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 12, 2023

Commit ID: 078be43

Build ID: 9

Status: ❌ CI test failed in Stage [C++ GPU].

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jul 13, 2023

I am not sure if this new code path or the existing int32_t specialization is faster. One would need to do a benchmark to see if the cusparse csr2coo or cub is better. However, for the int64_t specialization, CUB should outperform the existing implementation in all cases.
Cusparse code path:

CUSPARSE_CALL(cusparseXcsr2coo(

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 13, 2023

Commit ID: 02bdcb5

Build ID: 10

Status: ❌ CI test failed in Stage [C++ GPU].

Report path: link

Full logs path: link

@frozenbugs
Copy link
Collaborator

@mfbalin can you help us take a look at the failing test?

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 14, 2023

Commit ID: befcf8a

Build ID: 11

Status: ❌ CI test failed in Stage [C++ GPU].

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jul 14, 2023

@frozenbugs Partition had a minor bug due to the 1 off specification of histogramEven call parameters. Tests should all pass now.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 14, 2023

Commit ID: 8915f8c868838cf41b01cae6097ba88c51302698

Build ID: 12

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jul 14, 2023

Actually, we might want to fix this bug in a separate PR so that if there is any problem with the new thrust version (on the off change), we can take this PR back but the bug fix is still stays. see: #6001

@frozenbugs
Copy link
Collaborator

Actually, we might want to fix this bug in a separate PR so that if there is any problem with the new thrust version (on the off change), we can take this PR back but the bug fix is still stays.

Solid judgement :)

@frozenbugs
Copy link
Collaborator

Partition had a minor bug due to the 1 off specification of histogramEven call parameters. Tests should all pass now.

Thanks for fixing, any reason this was not exposed without your optimization?

@mfbalin
Copy link
Collaborator Author

mfbalin commented Jul 14, 2023

The histogram has bins, let's say we are computing a histogram for bins with values 0 to 6 (7 bins in total). Without the fix, the bin boundaries were [0, 8) (8 exclusive) vs [0, 7). Documentation says that CUB evenly distributed the values in the range to bins. But 8 doesn't divide 7. So without the fix, with the new update, 0 and 1 was in the first bin. Without the update 6 and 7 were in the last bin (Which didn't expose the bug). I am speculating that this caused a rounding error and a change in the implementation in CUB might have exposed this bug.

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 14, 2023

Commit ID: 406ec17

Build ID: 13

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@frozenbugs frozenbugs merged commit 8311579 into dmlc:master Jul 14, 2023
1 check passed
Rhett-Ying pushed a commit that referenced this pull request Aug 10, 2023
Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>
Rhett-Ying added a commit that referenced this pull request Aug 14, 2023
Rhett-Ying added a commit that referenced this pull request Aug 14, 2023
DominikaJedynak pushed a commit to DominikaJedynak/dgl that referenced this pull request Mar 12, 2024
Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr: Suspended PR status
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants