cub::BlockReduce

Defined in cub/block/block_reduce.cuh

template<typename T, int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int LEGACY_PTX_ARCH = 0>
class BlockReduce

The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block.

Overview

  • A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • BlockReduce can be optionally specialized by algorithm to accommodate different latency/throughput workload profiles:

    1. cub::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY: An efficient “raking” reduction algorithm that only supports commutative reduction operators.

    2. cub::BLOCK_REDUCE_RAKING: An efficient “raking” reduction algorithm that supports commutative and non-commutative reduction operators.

    3. cub::BLOCK_REDUCE_WARP_REDUCTIONS: A quick “tiled warp-reductions” reduction algorithm that supports commutative and non-commutative reduction operators.

Performance Considerations

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • Very efficient (only one synchronization barrier).

  • Incurs zero bank conflicts for most types

  • Computation is slightly more efficient (i.e., having lower instruction overhead) for: - Summation (vs. generic reduction) - BLOCK_THREADS is a multiple of the architecture’s warp size - Every thread has a valid input (i.e., full vs. partial-tiles)

  • See cub::BlockReduceAlgorithm for performance details regarding algorithmic alternatives

A Simple Example

Every thread in the block uses the BlockReduce class by first specializing the BlockReduce type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.

The code snippet below illustrates a sum reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Compute the block-wide sum for thread0
    int aggregate = BlockReduce(temp_storage).Sum(thread_data);

Re-using dynamically allocating shared memory

The block/example_block_reduce_dyn_smem.cu example illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region.

Template Parameters
  • T – Data type being reduced

  • BLOCK_DIM_X – The thread block length in threads along the X dimension

  • ALGORITHM[optional] cub::BlockReduceAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_REDUCE_WARP_REDUCTIONS)

  • BLOCK_DIM_Y[optional] The thread block length in threads along the Y dimension (default: 1)

  • BLOCK_DIM_Z[optional] The thread block length in threads along the Z dimension (default: 1)

  • LEGACY_PTX_ARCH[optional] Unused.

Collective constructors

inline BlockReduce()

Collective constructor using a private static allocation of shared memory as temporary storage.

inline BlockReduce(TempStorage &temp_storage)

Collective constructor using the specified memory allocation as temporary storage.

Parameters

temp_storage[in] Reference to memory allocation having layout type TempStorage

Generic reductions

template<typename ReductionOp>
inline T Reduce(T input, ReductionOp reduction_op)

Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element.

  • The return value is undefined in threads other than thread0.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a max reduction of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Each thread obtains an input item
    int thread_data;
    ...

    // Compute the block-wide max for thread0
    int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cuda::maximum<>{});

Template Parameters

ReductionOp[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input

  • reduction_op[in] Binary reduction functor

template<int ITEMS_PER_THREAD, typename ReductionOp>
inline T Reduce(T (&inputs)[ITEMS_PER_THREAD], ReductionOp reduction_op)

Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements.

  • The return value is undefined in threads other than thread0.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a max reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Compute the block-wide max for thread0
    int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cuda::maximum<>{});

Template Parameters
  • ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

  • ReductionOp[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)

Parameters
  • inputs[in] Calling thread’s input segment

  • reduction_op[in] Binary reduction functor

template<typename ReductionOp>
inline T Reduce(T input, ReductionOp reduction_op, int num_valid)

Computes a block-wide reduction for thread0 using the specified binary reduction functor. The first num_valid threads each contribute one input element.

  • The return value is undefined in threads other than thread<sub>0</sub>.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a max reduction of a partially-full tile of integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(int num_valid, ...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Each thread obtains an input item
    int thread_data;
    if (threadIdx.x < num_valid) thread_data = ...

    // Compute the block-wide max for thread0
    int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cuda::maximum<>{}, num_valid);

Template Parameters

ReductionOp[inferred] Binary reduction functor type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Calling thread’s input

  • reduction_op[in] Binary reduction functor

  • num_valid[in] Number of threads containing valid elements (may be less than BLOCK_THREADS)

Summation reductions

inline T Sum(T input)

Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element.

  • The return value is undefined in threads other than thread0.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a sum reduction of 128 integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Each thread obtains an input item
    int thread_data;
    ...

    // Compute the block-wide sum for thread0
    int aggregate = BlockReduce(temp_storage).Sum(thread_data);

Parameters

input[in] Calling thread’s input

template<int ITEMS_PER_THREAD>
inline T Sum(T (&inputs)[ITEMS_PER_THREAD])

Computes a block-wide reduction for thread<sub>0</sub> using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements.

  • The return value is undefined in threads other than thread0.

  • Efficiency is increased with increased granularity ITEMS_PER_THREAD. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants of cub::BlockLoad for efficiently gathering a blocked arrangement of elements across threads.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a sum reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Compute the block-wide sum for thread0
    int aggregate = BlockReduce(temp_storage).Sum(thread_data);

Template Parameters

ITEMS_PER_THREAD[inferred] The number of consecutive items partitioned onto each thread.

Parameters

inputs[in] Calling thread’s input segment

inline T Sum(T input, int num_valid)

Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. The first num_valid threads each contribute one input element.

  • The return value is undefined in threads other than thread0.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates a sum reduction of a partially-full tile of integer items that are partitioned across 128 threads.

#include <cub/cub.cuh>   // or equivalently <cub/block/block_reduce.cuh>

__global__ void ExampleKernel(int num_valid, ...)
{
    // Specialize BlockReduce for a 1D block of 128 threads of type int
    using BlockReduce = cub::BlockReduce<int, 128>;

    // Allocate shared memory for BlockReduce
    __shared__ typename BlockReduce::TempStorage temp_storage;

    // Each thread obtains an input item (up to num_items)
    int thread_data;
    if (threadIdx.x < num_valid)
        thread_data = ...

    // Compute the block-wide sum for thread0
    int aggregate = BlockReduce(temp_storage).Sum(thread_data, num_valid);

Parameters
  • input[in] Calling thread’s input

  • num_valid[in] Number of threads containing valid elements (may be less than BLOCK_THREADS)

struct TempStorage : public Uninitialized<_TempStorage>

The operations exposed by BlockReduce require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union’d with other storage allocation types to facilitate memory reuse.