-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance][CUDA] Faster CSRToCOO #5648
Conversation
Not authorized to trigger CI. Please ask core developer to help trigger via issuing comment:
|
This comment was marked as outdated.
This comment was marked as outdated.
@BarclayII @yaox12, could you take a look? This PR requires using the latest thrust and cub versions. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
|
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
It looks like the libcudacxx headers can not be found even though I added it to the cuda include directories, what else needs to be done so that those headers can be found? Error msg: Edit: added |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
I am not sure if this new code path or the existing int32_t specialization is faster. One would need to do a benchmark to see if the cusparse csr2coo or cub is better. However, for the int64_t specialization, CUB should outperform the existing implementation in all cases. Line 44 in 078be43
|
@mfbalin can you help us take a look at the failing test? |
@frozenbugs Partition had a minor bug due to the 1 off specification of histogramEven call parameters. Tests should all pass now. |
Actually, we might want to fix this bug in a separate PR so that if there is any problem with the new thrust version (on the off change), we can take this PR back but the bug fix is still stays. see: #6001 |
Solid judgement :) |
Thanks for fixing, any reason this was not exposed without your optimization? |
The histogram has bins, let's say we are computing a histogram for bins with values 0 to 6 (7 bins in total). Without the fix, the bin boundaries were [0, 8) (8 exclusive) vs [0, 7). Documentation says that CUB evenly distributed the values in the range to bins. But 8 doesn't divide 7. So without the fix, with the new update, 0 and 1 was in the first bin. Without the update 6 and 7 were in the last bin (Which didn't expose the bug). I am speculating that this caused a rounding error and a change in the implementation in CUB might have exposed this bug. |
Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>
This reverts commit a72284d.
Co-authored-by: Hongzhi (Steve), Chen <chenhongzhi.nkcs@gmail.com>
Description
CUB recently implemented a DeviceCopy::Batched algorithm, which can be used to implement the DeviceRunLength::Decode algorithm, which in turn can be used to implement the CSRToCOO for the GPU in a more efficient way for the 64-bit code path. The current implementation suffers from O(E log N) complexity, which this PR reduces to O(E).
My motivation was that CSRToCOO is called inside the labor sampler algorithm, so improving the performance will also improve Labor Sampler and make its complexity truly linear in the number of edges.
My experimental evaluations show that the Labor Sampler runtime consistently improves with this change.
Checklist
Please feel free to remove inapplicable items for your PR.