cub::BlockHistogram

Defined in /home/runner/work/cccl/cccl/cub/cub/block/block_histogram.cuh

template<typename T, int BLOCK_DIM_X, int ITEMS_PER_THREAD, int BINS, BlockHistogramAlgorithm ALGORITHM = BLOCK_HISTO_SORT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int LEGACY_PTX_ARCH = 0>
class BlockHistogram

The BlockHistogram class provides collective methods for constructing block-wide histograms from data samples partitioned across a CUDA thread block.

Overview

  • A histogram counts the number of observations that fall into each of the disjoint categories (known as bins).

  • The T type must be implicitly castable to an integer type.

  • BlockHistogram expects each integral input[i] value to satisfy 0 <= input[i] < BINS. Values outside of this range result in undefined behavior.

  • BlockHistogram can be optionally specialized to use different algorithms:

    1. cub::BLOCK_HISTO_SORT: Sorting followed by differentiation.

    2. cub::BLOCK_HISTO_ATOMIC: Use atomic addition to update byte counts directly.

A Simple Example

Every thread in the block uses the BlockHistogram class by first specializing the BlockHistogram type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.

The code snippet below illustrates a 256-bin histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_histogram.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each
    using BlockHistogram = cub::BlockHistogram<unsigned char, 128, 4, 256>;

    // Allocate shared memory for BlockHistogram
    __shared__ typename BlockHistogram::TempStorage temp_storage;

    // Allocate shared memory for block-wide histogram bin counts
    __shared__ unsigned int smem_histogram[256];

    // Obtain input samples per thread
    unsigned char data[4];
    ...

    // Compute the block-wide histogram
    BlockHistogram(temp_storage).Histogram(data, smem_histogram);

Performance and Usage Considerations

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • All input values must fall between [0, BINS), or behavior is undefined.

  • The histogram output can be constructed in shared or device-accessible memory

  • See cub::BlockHistogramAlgorithm for performance details regarding algorithmic alternatives

Re-using dynamically allocating shared memory

The block/example_block_reduce_dyn_smem.cu example illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region. This example can be easily adapted to the storage required by BlockHistogram.

Template Parameters
  • T – The sample type being histogrammed (must be castable to an integer bin identifier)

  • BLOCK_DIM_X – The thread block length in threads along the X dimension

  • ITEMS_PER_THREAD – The number of items per thread

  • BINS – The number bins within the histogram

  • ALGORITHM[optional] cub::BlockHistogramAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_HISTO_SORT)

  • BLOCK_DIM_Y[optional] The thread block length in threads along the Y dimension (default: 1)

  • BLOCK_DIM_Z[optional] The thread block length in threads along the Z dimension (default: 1)

  • LEGACY_PTX_ARCH[optional] Unused.

Collective constructors

inline BlockHistogram()

Collective constructor using a private static allocation of shared memory as temporary storage.

inline BlockHistogram(TempStorage &temp_storage)

Collective constructor using the specified memory allocation as temporary storage.

Parameters

temp_storage[in] Reference to memory allocation having layout type TempStorage

Histogram operations

template<typename CounterT>
inline void InitHistogram(CounterT histogram[BINS])

Initialize the shared histogram counters to zero.

Snippet

The code snippet below illustrates a the initialization and update of a histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_histogram.cuh>

__global__ void ExampleKernel(...)
{
  // Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each
  using BlockHistogram = cub::BlockHistogram<unsigned char, 128, 4, 256>;

  // Allocate shared memory for BlockHistogram
  __shared__ typename BlockHistogram::TempStorage temp_storage;

  // Allocate shared memory for block-wide histogram bin counts
  __shared__ unsigned int smem_histogram[256];

  // Obtain input samples per thread
  unsigned char thread_samples[4];
  ...

  // Initialize the block-wide histogram
  BlockHistogram(temp_storage).InitHistogram(smem_histogram);

  // Update the block-wide histogram
  BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram);

Template Parameters

CounterT[inferred] Histogram counter type

template<typename CounterT>
inline void Histogram(T (&items)[ITEMS_PER_THREAD], CounterT histogram[BINS])

Constructs a block-wide histogram in shared/device-accessible memory. Each thread contributes an array of input elements.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a 256-bin histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_histogram.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each
    using BlockHistogram = cub::BlockHistogram<unsigned char, 128, 4, 256>;

    // Allocate shared memory for BlockHistogram
    __shared__ typename BlockHistogram::TempStorage temp_storage;

    // Allocate shared memory for block-wide histogram bin counts
    __shared__ unsigned int smem_histogram[256];

    // Obtain input samples per thread
    unsigned char thread_samples[4];
    ...

    // Compute the block-wide histogram
    BlockHistogram(temp_storage).Histogram(thread_samples, smem_histogram);

Template Parameters

CounterT[inferred] Histogram counter type

Parameters
  • items[in] Calling thread’s input values to histogram

  • histogram[out] Reference to shared/device-accessible memory histogram

template<typename CounterT>
inline void Composite(T (&items)[ITEMS_PER_THREAD], CounterT histogram[BINS])

Updates an existing block-wide histogram in shared/device-accessible memory. Each thread composites an array of input elements.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a the initialization and update of a histogram of 512 integer samples that are partitioned across 128 threads where each thread owns 4 samples.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_histogram.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize a 256-bin BlockHistogram type for a 1D block of 128 threads having 4 character samples each
    using BlockHistogram = cub::BlockHistogram<unsigned char, 128, 4, 256>;

    // Allocate shared memory for BlockHistogram
    __shared__ typename BlockHistogram::TempStorage temp_storage;

    // Allocate shared memory for block-wide histogram bin counts
    __shared__ unsigned int smem_histogram[256];

    // Obtain input samples per thread
    unsigned char thread_samples[4];
    ...

    // Initialize the block-wide histogram
    BlockHistogram(temp_storage).InitHistogram(smem_histogram);

    // Update the block-wide histogram
    BlockHistogram(temp_storage).Composite(thread_samples, smem_histogram);

Template Parameters

CounterT[inferred] Histogram counter type

Parameters
  • items[in] Calling thread’s input values to histogram

  • histogram[out] Reference to shared/device-accessible memory histogram

struct TempStorage : public Uninitialized<_TempStorage>

The operations exposed by BlockHistogram require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union’d with other storage allocation types to facilitate memory reuse.