cub::BlockReduce#

template<typename T, int BlockDimX, BlockReduceAlgorithm Algorithm = BLOCK_REDUCE_WARP_REDUCTIONS, int BlockDimY = 1, int BlockDimZ = 1> class BlockReduce#

The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block.

Overview#

A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
BlockReduce can be optionally specialized by algorithm to accommodate different latency/throughput workload profiles:
1. cub::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY: An efficient “raking” reduction algorithm that only supports commutative reduction operators.
2. cub::BLOCK_REDUCE_RAKING: An efficient “raking” reduction algorithm that supports commutative and non-commutative reduction operators.
3. cub::BLOCK_REDUCE_WARP_REDUCTIONS: A quick “tiled warp-reductions” reduction algorithm that supports commutative and non-commutative reduction operators.
4. cub::BLOCK_REDUCE_WARP_REDUCTIONS_NONDETERMINISTIC: A quick “tiled warp-reductions” reduction algorithm that supports commutative and non-commutative reduction operators. This variant uses atomic operations to reduce the warp-wide reduction results, making it non-deterministic, i.e. the order of reduction operations is not guaranteed to be the same across different invocations of the same kernel.

Performance Considerations#

Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
Very efficient (only one synchronization barrier).
Incurs zero bank conflicts for most types
Computation is slightly more efficient (i.e., having lower instruction overhead) for: - Summation (vs. generic reduction) - BLOCK_THREADS is a multiple of the architecture’s warp size - Every thread has a valid input (i.e., full vs. partial-tiles)
See cub::BlockReduceAlgorithm for performance details regarding algorithmic alternatives

A Simple Example#

Every thread in the block uses the BlockReduce class by first specializing the BlockReduce type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.

The code snippet below illustrates a sum reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Compute the block-wide sum for thread0
    int aggregate = BlockReduce(temp_storage).Sum(thread_data);
}

Re-using dynamically allocating shared memory#

The block/example_block_reduce_dyn_smem.cu example illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region.

Template Parameters:

T – Data type being reduced
BlockDimX – The thread block length in threads along the X dimension
Algorithm – [optional] cub::BlockReduceAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_REDUCE_WARP_REDUCTIONS)
BlockDimY – [optional] The thread block length in threads along the Y dimension (default: 1)
BlockDimZ – [optional] The thread block length in threads along the Z dimension (default: 1)

Collective constructors

inline BlockReduce()#

Collective constructor using a private static allocation of shared memory as temporary storage.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

inline BlockReduce(TempStorage &temp_storage)#

Collective constructor using the specified memory allocation as temporary storage.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

Parameters:: temp_storage – [in] Reference to memory allocation having layout type TempStorage

Generic reductions

template<typename ReductionOp> inline T Reduce( T input, ReductionOp reduction_op )#

Computes a block-wide reduction for thread₀ using the specified binary reduction functor. Each thread contributes one input element.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The return value is undefined in threads other than thread₀.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates a max reduction of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Each thread obtains an input item
    int thread_data;
    ...

    // Compute the block-wide max for thread0
    int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cuda::maximum<>{});
}

Template Parameters:

ReductionOp – [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)

Parameters:

input – [in] Calling thread’s input
reduction_op – [in] Binary reduction functor

template<int ITEMS_PER_THREAD, typename ReductionOp> inline T Reduce( T (&inputs)[ITEMS_PER_THREAD], ReductionOp reduction_op )#

Computes a block-wide reduction for thread₀ using the specified binary reduction functor. Each thread contributes an array of consecutive input elements.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The return value is undefined in threads other than thread₀.
Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates a max reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Compute the block-wide max for thread0
    int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cuda::maximum<>{});
}

Template Parameters:

ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
ReductionOp – [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)

Parameters:

inputs – [in] Calling thread’s input segment
reduction_op – [in] Binary reduction functor

template<typename ReductionOp> inline T Reduce( T input, ReductionOp reduction_op, int num_valid )#

Computes a block-wide reduction for thread₀ using the specified binary reduction functor. The first num_valid threads each contribute one input element.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The return value is undefined in threads other than thread<sub>0</sub>.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates a max reduction of a partially-full tile of integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(int num_valid, ...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Each thread obtains an input item
    int thread_data;
    if (threadIdx.x < num_valid) thread_data = ...

    // Compute the block-wide max for thread0
    int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cuda::maximum<>{}, num_valid);
}

Template Parameters:

ReductionOp – [inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)

Parameters:

input – [in] Calling thread’s input
reduction_op – [in] Binary reduction functor
num_valid – [in] Number of threads containing valid elements (may be less than BLOCK_THREADS)

Summation reductions

inline T Sum(T input)#

Computes a block-wide reduction for thread₀ using addition (+) as the reduction operator. Each thread contributes one input element.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The return value is undefined in threads other than thread₀.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates a sum reduction of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Each thread obtains an input item
    int thread_data;
    ...

    // Compute the block-wide sum for thread0
    int aggregate = BlockReduce(temp_storage).Sum(thread_data);
}

Parameters:: input – [in] Calling thread’s input

template<int ITEMS_PER_THREAD> inline T Sum( T (&inputs)[ITEMS_PER_THREAD] )#

Computes a block-wide reduction for thread<sub>0</sub> using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The return value is undefined in threads other than thread₀.
Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates a sum reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Compute the block-wide sum for thread0
    int aggregate = BlockReduce(temp_storage).Sum(thread_data);
}

Template Parameters:: ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
Parameters:: inputs – [in] Calling thread’s input segment

inline T Sum(T input, int num_valid)#

Computes a block-wide reduction for thread₀ using addition (+) as the reduction operator. The first num_valid threads each contribute one input element.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

The return value is undefined in threads other than thread₀.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates a sum reduction of a partially-full tile of integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(int num_valid, ...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Each thread obtains an input item (up to num_items)
    int thread_data;
    if (threadIdx.x < num_valid)
        thread_data = ...

    // Compute the block-wide sum for thread0
    int aggregate = BlockReduce(temp_storage).Sum(thread_data, num_valid);
}

Parameters:

input – [in] Calling thread’s input
num_valid – [in] Number of threads containing valid elements (may be less than BLOCK_THREADS)

struct TempStorage : public Uninitialized<_TempStorage>#: The operations exposed by BlockReduce require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union’d with other storage allocation types to facilitate memory reuse.