You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CUB’s histogram is allocating memory for per-thread block privatised histograms in global memory. If the histogram comprises many bins this approach requires extensive memory, ultimately exceeding available device memory. For high cardinality histograms, we probably want to pursue a different strategy.
For instance, for 28854312 bins, this may require 55 GB of memory for a sample size of 28854312. That is, 240 * 28854312 * 8 = 55 400 279 040 (thread blocks: 240, bins: 28854312, bytes per bin: 8). The 240 thread blocks may very depending on your GPU.
Here's a reproducer that @leofang has kindly provided (🙏):
__device__ long long atomicAdd(long long *address, long long val) {
return atomicAdd(reinterpret_cast<unsigned long long*>(address),
static_cast<unsigned long long>(val));
}
#include <cub/device/device_histogram.cuh>
int main() {
using namespace cub;
void* workspace = nullptr;
size_t workspace_size = 0;
typedef int h_sampleT;
typedef double h_binT;
void* input = nullptr;
void* output = nullptr;
int n_bins = 28854313;
void* bins = nullptr;
int n_samples = 28854312;
DeviceHistogram::HistogramRange(workspace, workspace_size, static_cast<h_sampleT*>(input),
static_cast<long long*>(output), n_bins, static_cast<h_binT*>(bins), n_samples, nullptr);
std::cout << "workspace_size:" << workspace_size << std::endl;
return 0;
}
An alternative approach for high cardinality histograms is to use a combination of DeviceRadixSort and DeviceRunLengthEncode. Here's an example outlining the algorithm: https://godbolt.org/z/4sn8859fM
The text was updated successfully, but these errors were encountered:
As an alternative, I think we could fix this issue (and likely improve performance) by avoiding allocating per-CTA privatized histograms in global memory when each CTA's histogram doesn't fit in shared memory.
In that case, I believe it would be better to just allocate a single histogram in global memory and update it with atomics.
There seems to be perf issue as well. For 2k bins I can see about 10% BW. For 200k bins it's already 1-4% (I32 samples). For I64 these numbers are 2x higher, but it's still pretty low.
CUB’s histogram is allocating memory for per-thread block privatised histograms in global memory. If the histogram comprises many bins this approach requires extensive memory, ultimately exceeding available device memory. For high cardinality histograms, we probably want to pursue a different strategy.
For instance, for
28854312
bins, this may require 55 GB of memory for a sample size of28854312
. That is, 240 * 28854312 * 8 = 55 400 279 040 (thread blocks: 240, bins: 28854312, bytes per bin: 8). The 240 thread blocks may very depending on your GPU.Here's a reproducer that @leofang has kindly provided (🙏):
An alternative approach for high cardinality histograms is to use a combination of
DeviceRadixSort
andDeviceRunLengthEncode
. Here's an example outlining the algorithm:https://godbolt.org/z/4sn8859fM
The text was updated successfully, but these errors were encountered: