cub::BlockHistogram
Defined in cub/block/block_histogram.cuh
-
template<typename T, int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int LEGACY_PTX_ARCH = 0>
class BlockHistogram The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.
Overview
A histogram counts the number of observations that fall into each of the disjoint categories (known as bins).
The
T
type must be implicitly castable to an integer type.BlockHistogram expects each integral
input[i]
value to satisfy0 <= input[i] < BINS
. Values outside of this range result in undefined behavior.BlockHistogram can be optionally specialized to use different algorithms:
cub::BLOCK_HISTO_SORT
: Sorting followed by differentiation.cub::BLOCK_HISTO_ATOMIC
: Use atomic addition to update byte counts directly.
A Simple Example
Every thread in the block uses the BlockHistogram class by first specializing the BlockHistogram type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
The code snippet below illustrates a 256-bin histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.
#include <cub/cub.cuh> // or equivalently <cub/block/block_histogram.cuh> __global__ void ExampleKernel(...) { // Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each using BlockHistogram = cub::BlockHistogram<unsigned char, 128, 4, 256>; // Allocate shared memory for BlockHistogram __shared__ typename BlockHistogram::TempStorage temp_storage; // Allocate shared memory for block-wide histogram bin counts __shared__ unsigned int smem_histogram[256]; // Obtain input samples per thread unsigned char data[4]; ... // Compute the block-wide histogram BlockHistogram(temp_storage).Histogram(data, smem_histogram);
Performance and Usage Considerations
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.All input values must fall between
[0, BINS)
, or behavior is undefined.The histogram output can be constructed in shared or device-accessible memory
See
cub::BlockHistogramAlgorithm
for performance details regarding algorithmic alternatives
- Template Parameters
T – The sample type being histogrammed (must be castable to an integer bin identifier)
BLOCK_DIM_X – The thread block length in threads along the X dimension
ITEMS_PER_THREAD – The number of items per thread
BINS – The number bins within the histogram
ALGORITHM – [optional] cub::BlockHistogramAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_HISTO_SORT)
BLOCK_DIM_Y – [optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z – [optional] The thread block length in threads along the Z dimension (default: 1)
LEGACY_PTX_ARCH – [optional] Unused.
Collective constructors
-
inline BlockHistogram()
Collective constructor using a private static allocation of shared memory as temporary storage.
-
inline BlockHistogram(TempStorage &temp_storage)
Collective constructor using the specified memory allocation as temporary storage.
- Parameters
temp_storage – [in] Reference to memory allocation having layout type TempStorage
Histogram operations
-
template<typename CounterT>
inline void InitHistogram(CounterT histogram[BINS]) Initialize the shared histogram counters to zero.
Snippet
The code snippet below illustrates a the initialization and update of a histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.
#include <cub/cub.cuh> // or equivalently <cub/block/block_histogram.cuh> __global__ void ExampleKernel(...) { // Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each using BlockHistogram = cub::BlockHistogram<unsigned char, 128, 4, 256>; // Allocate shared memory for BlockHistogram __shared__ typename BlockHistogram::TempStorage temp_storage; // Allocate shared memory for block-wide histogram bin counts __shared__ unsigned int smem_histogram[256]; // Obtain input samples per thread unsigned char thread_samples[4]; ... // Initialize the block-wide histogram BlockHistogram(temp_storage).InitHistogram(smem_histogram); // Update the block-wide histogram BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram);
- Template Parameters
CounterT – [inferred] Histogram counter type
-
template<typename CounterT>
inline void Histogram(T (&items)[ITEMS_PER_THREAD], CounterT histogram[BINS]) Constructs a block-wide histogram in shared/device-accessible memory. Each thread contributes an array of input elements.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a 256-bin histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.
#include <cub/cub.cuh> // or equivalently <cub/block/block_histogram.cuh> __global__ void ExampleKernel(...) { // Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each using BlockHistogram = cub::BlockHistogram<unsigned char, 128, 4, 256>; // Allocate shared memory for BlockHistogram __shared__ typename BlockHistogram::TempStorage temp_storage; // Allocate shared memory for block-wide histogram bin counts __shared__ unsigned int smem_histogram[256]; // Obtain input samples per thread unsigned char thread_samples[4]; ... // Compute the block-wide histogram BlockHistogram(temp_storage).Histogram(thread_samples, smem_histogram);
- Template Parameters
CounterT – [inferred] Histogram counter type
- Parameters
items – [in] Calling thread’s input values to histogram
histogram – [out] Reference to shared/device-accessible memory histogram
-
template<typename CounterT>
inline void Composite(T (&items)[ITEMS_PER_THREAD], CounterT histogram[BINS]) Updates an existing block-wide histogram in shared/device-accessible memory. Each thread composites an array of input elements.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a the initialization and update of a histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.
#include <cub/cub.cuh> // or equivalently <cub/block/block_histogram.cuh> __global__ void ExampleKernel(...) { // Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each using BlockHistogram = cub::BlockHistogram<unsigned char, 128, 4, 256>; // Allocate shared memory for BlockHistogram __shared__ typename BlockHistogram::TempStorage temp_storage; // Allocate shared memory for block-wide histogram bin counts __shared__ unsigned int smem_histogram[256]; // Obtain input samples per thread unsigned char thread_samples[4]; ... // Initialize the block-wide histogram BlockHistogram(temp_storage).InitHistogram(smem_histogram); // Update the block-wide histogram BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram);
- Template Parameters
CounterT – [inferred] Histogram counter type
- Parameters
items – [in] Calling thread’s input values to histogram
histogram – [out] Reference to shared/device-accessible memory histogram
-
struct TempStorage : public Uninitialized<_TempStorage>
The operations exposed by BlockHistogram require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the
__shared__
keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) orunion
’d with other storage allocation types to facilitate memory reuse.