Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeviceHistogram: add support for high cardinality bins without running OOM #912

Open
elstehle opened this issue Jan 9, 2023 · 2 comments
Assignees
Labels
cub For all items related to CUB

Comments

@elstehle
Copy link
Collaborator

elstehle commented Jan 9, 2023

CUB’s histogram is allocating memory for per-thread block privatised histograms in global memory. If the histogram comprises many bins this approach requires extensive memory, ultimately exceeding available device memory. For high cardinality histograms, we probably want to pursue a different strategy.

For instance, for 28854312 bins, this may require 55 GB of memory for a sample size of 28854312. That is, 240 * 28854312 * 8 = 55 400 279 040 (thread blocks: 240, bins: 28854312, bytes per bin: 8). The 240 thread blocks may very depending on your GPU.

Here's a reproducer that @leofang has kindly provided (🙏):

__device__ long long atomicAdd(long long *address, long long val) {
    return atomicAdd(reinterpret_cast<unsigned long long*>(address),
                     static_cast<unsigned long long>(val));
}
#include <cub/device/device_histogram.cuh>

int main() {
    using namespace cub;

    void* workspace = nullptr;
    size_t workspace_size = 0;
    typedef int h_sampleT;
    typedef double h_binT;
    void* input = nullptr;
    void* output = nullptr;
    int n_bins = 28854313;
    void* bins = nullptr;
    int n_samples = 28854312;
    DeviceHistogram::HistogramRange(workspace, workspace_size, static_cast<h_sampleT*>(input),
                                    static_cast<long long*>(output), n_bins, static_cast<h_binT*>(bins), n_samples, nullptr);
    std::cout << "workspace_size:" << workspace_size << std::endl;
    return 0;
}

An alternative approach for high cardinality histograms is to use a combination of DeviceRadixSort and DeviceRunLengthEncode. Here's an example outlining the algorithm:
https://godbolt.org/z/4sn8859fM

@jrhemstad
Copy link
Collaborator

As an alternative, I think we could fix this issue (and likely improve performance) by avoiding allocating per-CTA privatized histograms in global memory when each CTA's histogram doesn't fit in shared memory.

In that case, I believe it would be better to just allocate a single histogram in global memory and update it with atomics.

@gevtushenko
Copy link
Collaborator

There seems to be perf issue as well. For 2k bins I can see about 10% BW. For 200k bins it's already 1-4% (I32 samples). For I64 these numbers are 2x higher, but it's still pretty low.

@gevtushenko gevtushenko mentioned this issue Jul 25, 2023
2 tasks
@jarmak-nv jarmak-nv transferred this issue from NVIDIA/cub Nov 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cub For all items related to CUB
Projects
Status: Todo
Development

No branches or pull requests

4 participants