DeviceHistogram: add support for high cardinality bins without running OOM #912

elstehle · 2023-01-09T13:42:31Z

CUB’s histogram is allocating memory for per-thread block privatised histograms in global memory. If the histogram comprises many bins this approach requires extensive memory, ultimately exceeding available device memory. For high cardinality histograms, we probably want to pursue a different strategy.

For instance, for 28854312 bins, this may require 55 GB of memory for a sample size of 28854312. That is, 240 * 28854312 * 8 = 55 400 279 040 (thread blocks: 240, bins: 28854312, bytes per bin: 8). The 240 thread blocks may very depending on your GPU.

Here's a reproducer that @leofang has kindly provided (🙏):

__device__ long long atomicAdd(long long *address, long long val) {
    return atomicAdd(reinterpret_cast<unsigned long long*>(address),
                     static_cast<unsigned long long>(val));
}
#include <cub/device/device_histogram.cuh>

int main() {
    using namespace cub;

    void* workspace = nullptr;
    size_t workspace_size = 0;
    typedef int h_sampleT;
    typedef double h_binT;
    void* input = nullptr;
    void* output = nullptr;
    int n_bins = 28854313;
    void* bins = nullptr;
    int n_samples = 28854312;
    DeviceHistogram::HistogramRange(workspace, workspace_size, static_cast<h_sampleT*>(input),
                                    static_cast<long long*>(output), n_bins, static_cast<h_binT*>(bins), n_samples, nullptr);
    std::cout << "workspace_size:" << workspace_size << std::endl;
    return 0;
}

An alternative approach for high cardinality histograms is to use a combination of DeviceRadixSort and DeviceRunLengthEncode. Here's an example outlining the algorithm:
https://godbolt.org/z/4sn8859fM

The text was updated successfully, but these errors were encountered:

jrhemstad · 2023-01-11T17:30:53Z

As an alternative, I think we could fix this issue (and likely improve performance) by avoiding allocating per-CTA privatized histograms in global memory when each CTA's histogram doesn't fit in shared memory.

In that case, I believe it would be better to just allocate a single histogram in global memory and update it with atomics.

gevtushenko · 2023-07-24T16:10:42Z

There seems to be perf issue as well. For 2k bins I can see about 10% BW. For 200k bins it's already 1-4% (I32 samples). For I64 these numbers are 2x higher, but it's still pretty low.

leofang mentioned this issue Jan 17, 2023

Work around a potential OOM error raised by CUB histogram cupy/cupy#7316

Merged

jrhemstad added the cub For all items related to CUB label Feb 22, 2023

jarmak-nv assigned ericniebler Feb 23, 2023

leofang mentioned this issue Jul 6, 2023

cupy.bincount can give incorrect results on CuPy 12.x with CUB enabled cupy/cupy#7698

Closed

gevtushenko mentioned this issue Jul 25, 2023

Tune Histogram on H100 #266

Merged

2 tasks

jarmak-nv transferred this issue from NVIDIA/cub Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeviceHistogram: add support for high cardinality bins without running OOM #912

DeviceHistogram: add support for high cardinality bins without running OOM #912

elstehle commented Jan 9, 2023 •

edited

Loading

jrhemstad commented Jan 11, 2023

gevtushenko commented Jul 24, 2023

DeviceHistogram: add support for high cardinality bins without running OOM #912

DeviceHistogram: add support for high cardinality bins without running OOM #912

Comments

elstehle commented Jan 9, 2023 • edited Loading

jrhemstad commented Jan 11, 2023

gevtushenko commented Jul 24, 2023

elstehle commented Jan 9, 2023 •

edited

Loading