cub::BlockReduce
Defined in cub/block/block_reduce.cuh
-
template<typename T, int BLOCK_DIM_X, BlockReduceAlgorithm ALGORITHM = BLOCK_REDUCE_WARP_REDUCTIONS, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int LEGACY_PTX_ARCH = 0>
class BlockReduce The BlockReduce class provides collective methods for computing a parallel reduction of items partitioned across a CUDA thread block.
Overview
A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
BlockReduce can be optionally specialized by algorithm to accommodate different latency/throughput workload profiles:
cub::BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY
: An efficient “raking” reduction algorithm that only supports commutative reduction operators.cub::BLOCK_REDUCE_RAKING
: An efficient “raking” reduction algorithm that supports commutative and non-commutative reduction operators.cub::BLOCK_REDUCE_WARP_REDUCTIONS
: A quick “tiled warp-reductions” reduction algorithm that supports commutative and non-commutative reduction operators.
Performance Considerations
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.Very efficient (only one synchronization barrier).
Incurs zero bank conflicts for most types
Computation is slightly more efficient (i.e., having lower instruction overhead) for: - Summation (vs. generic reduction) -
BLOCK_THREADS
is a multiple of the architecture’s warp size - Every thread has a valid input (i.e., full vs. partial-tiles)See cub::BlockReduceAlgorithm for performance details regarding algorithmic alternatives
A Simple Example
Every thread in the block uses the BlockReduce class by first specializing the BlockReduce type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
The code snippet below illustrates a sum reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh> __global__ void ExampleKernel(...) { // Specialize BlockReduce for a 1D block of 128 threads of type int using BlockReduce = cub::BlockReduce<int, 128>; // Allocate shared memory for BlockReduce __shared__ typename BlockReduce::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Compute the block-wide sum for thread0 int aggregate = BlockReduce(temp_storage).Sum(thread_data);
- Template Parameters
T – Data type being reduced
BLOCK_DIM_X – The thread block length in threads along the X dimension
ALGORITHM – [optional] cub::BlockReduceAlgorithm enumerator specifying the underlying algorithm to use (default: cub::BLOCK_REDUCE_WARP_REDUCTIONS)
BLOCK_DIM_Y – [optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z – [optional] The thread block length in threads along the Z dimension (default: 1)
LEGACY_PTX_ARCH – [optional] Unused.
Collective constructors
-
inline BlockReduce()
Collective constructor using a private static allocation of shared memory as temporary storage.
-
inline BlockReduce(TempStorage &temp_storage)
Collective constructor using the specified memory allocation as temporary storage.
- Parameters
temp_storage – [in] Reference to memory allocation having layout type TempStorage
Generic reductions
-
template<typename ReductionOp>
inline T Reduce(T input, ReductionOp reduction_op) Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes one input element.
The return value is undefined in threads other than thread0.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a max reduction of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh> __global__ void ExampleKernel(...) { // Specialize BlockReduce for a 1D block of 128 threads of type int using BlockReduce = cub::BlockReduce<int, 128>; // Allocate shared memory for BlockReduce __shared__ typename BlockReduce::TempStorage temp_storage; // Each thread obtains an input item int thread_data; ... // Compute the block-wide max for thread0 int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cuda::maximum<>{});
- Template Parameters
ReductionOp – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input
reduction_op – [in] Binary reduction functor
-
template<int ITEMS_PER_THREAD, typename ReductionOp>
inline T Reduce(T (&inputs)[ITEMS_PER_THREAD], ReductionOp reduction_op) Computes a block-wide reduction for thread0 using the specified binary reduction functor. Each thread contributes an array of consecutive input elements.
The return value is undefined in threads other than thread0.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a max reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh> __global__ void ExampleKernel(...) { // Specialize BlockReduce for a 1D block of 128 threads of type int using BlockReduce = cub::BlockReduce<int, 128>; // Allocate shared memory for BlockReduce __shared__ typename BlockReduce::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Compute the block-wide max for thread0 int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cuda::maximum<>{});
- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
ReductionOp – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)
- Parameters
inputs – [in] Calling thread’s input segment
reduction_op – [in] Binary reduction functor
-
template<typename ReductionOp>
inline T Reduce(T input, ReductionOp reduction_op, int num_valid) Computes a block-wide reduction for thread0 using the specified binary reduction functor. The first
num_valid
threads each contribute one input element.The return value is undefined in threads other than thread<sub>0</sub>.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a max reduction of a partially-full tile of integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh> __global__ void ExampleKernel(int num_valid, ...) { // Specialize BlockReduce for a 1D block of 128 threads of type int using BlockReduce = cub::BlockReduce<int, 128>; // Allocate shared memory for BlockReduce __shared__ typename BlockReduce::TempStorage temp_storage; // Each thread obtains an input item int thread_data; if (threadIdx.x < num_valid) thread_data = ... // Compute the block-wide max for thread0 int aggregate = BlockReduce(temp_storage).Reduce(thread_data, cuda::maximum<>{}, num_valid);
- Template Parameters
ReductionOp – [inferred] Binary reduction functor type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Calling thread’s input
reduction_op – [in] Binary reduction functor
num_valid – [in] Number of threads containing valid elements (may be less than BLOCK_THREADS)
Summation reductions
-
inline T Sum(T input)
Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. Each thread contributes one input element.
The return value is undefined in threads other than thread0.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a sum reduction of 128 integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh> __global__ void ExampleKernel(...) { // Specialize BlockReduce for a 1D block of 128 threads of type int using BlockReduce = cub::BlockReduce<int, 128>; // Allocate shared memory for BlockReduce __shared__ typename BlockReduce::TempStorage temp_storage; // Each thread obtains an input item int thread_data; ... // Compute the block-wide sum for thread0 int aggregate = BlockReduce(temp_storage).Sum(thread_data);
- Parameters
input – [in] Calling thread’s input
-
template<int ITEMS_PER_THREAD>
inline T Sum(T (&inputs)[ITEMS_PER_THREAD]) Computes a block-wide reduction for thread<sub>0</sub> using addition (+) as the reduction operator. Each thread contributes an array of consecutive input elements.
The return value is undefined in threads other than thread0.
Efficiency is increased with increased granularity
ITEMS_PER_THREAD
. Performance is also typically increased until the additional register pressure or shared memory allocation size causes SM occupancy to fall too low. Consider variants ofcub::BlockLoad
for efficiently gathering a blocked arrangement of elements across threads.A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a sum reduction of 512 integer items that are partitioned in a blocked arrangement across 128 threads where each thread owns 4 consecutive items.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh> __global__ void ExampleKernel(...) { // Specialize BlockReduce for a 1D block of 128 threads of type int using BlockReduce = cub::BlockReduce<int, 128>; // Allocate shared memory for BlockReduce __shared__ typename BlockReduce::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Compute the block-wide sum for thread0 int aggregate = BlockReduce(temp_storage).Sum(thread_data);
- Template Parameters
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.
- Parameters
inputs – [in] Calling thread’s input segment
-
inline T Sum(T input, int num_valid)
Computes a block-wide reduction for thread0 using addition (+) as the reduction operator. The first
num_valid
threads each contribute one input element.The return value is undefined in threads other than thread0.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates a sum reduction of a partially-full tile of integer items that are partitioned across 128 threads.
#include <cub/cub.cuh> // or equivalently <cub/block/block_reduce.cuh> __global__ void ExampleKernel(int num_valid, ...) { // Specialize BlockReduce for a 1D block of 128 threads of type int using BlockReduce = cub::BlockReduce<int, 128>; // Allocate shared memory for BlockReduce __shared__ typename BlockReduce::TempStorage temp_storage; // Each thread obtains an input item (up to num_items) int thread_data; if (threadIdx.x < num_valid) thread_data = ... // Compute the block-wide sum for thread0 int aggregate = BlockReduce(temp_storage).Sum(thread_data, num_valid);
- Parameters
input – [in] Calling thread’s input
num_valid – [in] Number of threads containing valid elements (may be less than BLOCK_THREADS)
-
struct TempStorage : public Uninitialized<_TempStorage>
The operations exposed by BlockReduce require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the
__shared__
keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) orunion
’d with other storage allocation types to facilitate memory reuse.