cub::ThreadReduce#

template<typename Input, typename ReductionOp, typename ValueT = ::cuda::std::iter_value_t<Input>, typename AccumT = ::cuda::std::__accumulator_t<ReductionOp, ValueT>> AccumT cub::ThreadReduce( const Input &input, ReductionOp reduction_op )#

Reduction over statically-sized array-like types.

The ThreadReduce function computes a reduction of items assigned to a single CUDA thread.

Overview#

A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.
Supports array-like types that are statically-sized and can be indexed with the [] operator: raw arrays, std::array, std::span, std::mdspan (C++23)

Main Function and Overloading#

Reduction over statically-sized array-like types, seeded with the specified prefix

template <typename Input,
          typename ReductionOp,
          typename ValueT = ..., // type of a single input element
          typename AccumT = ...> // accumulator type
[[nodiscard]] __device__ __forceinline__ AccumT
ThreadReduce(const Input& input, ReductionOp reduction_op)

template <typename Input,
          typename ReductionOp,
          typename PrefixT,
          typename ValueT = ..., // type of a single input element
          typename AccumT = ...> // accumulator type
[[nodiscard]] __device__ __forceinline__ AccumT
ThreadReduce(const Input& input, ReductionOp reduction_op, PrefixT prefix)

Performance Considerations#

The function provides the following optimizations:

Vectorization/SIMD for:
- Minimum (cuda::minimum<>) and Maximum (cuda::maximum<>) on SM90+ for int16_t/uint16_t data types (Hopper DPX instructions)
- Sum (cuda::std::plus<>) and Multiplication (cuda::std::multiplies<>) on SM80+ for __nv_bfloat16 data type
- Minimum (cuda::minimum<>) and Maximum (cuda::maximum<>) on SM80+ for __half/__nv_bfloat16 data types
- Sum (cuda::std::plus<>) and Multiplication (cuda::std::multiplies<>) on SM70+ for __half data type
Instruction-Level Parallelism (ILP) by exploiting a ternary tree reduction for:
- Minimum (cuda::minimum<>) and Maximum (cuda::maximum<>) on SM90+ for int32_t/uint32_t data types (Hopper DPX instructions)
- Minimum (cuda::minimum<>) and Maximum (cuda::maximum<>) on SM80+ for integer data types (Hopper DPX instructions), __half2, __nv_bfloat162, __half (after vectorization), and __nv_bfloat16 (after vectorization) data types
- Sum (cuda::std::plus<>), Bitwise AND (cuda::std::bit_and<>), OR (cuda::std::bit_or<>), XOR (cuda::std::bit_xor<>) on SM50+ for integer data types
Instruction-Level Parallelism (ILP) by exploiting a binary tree reduction for
- All other cases that maps to predefined operators

Simple Example#

The code snippet below illustrates a simple sum reductions over 4 integer values.

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    int array[4] = {1, 2, 3, 4};
    int sum      = cub::ThreadReduce(array, ::cuda::std::plus<>{}); // sum = 10
}

Template Parameters:

Input – [inferred] The data type to be reduced having member operator[](int i) and must be statically-sized (size() method or static array)
ReductionOp – [inferred] Binary reduction operator type having member T operator()(const T &a, const T &b)

Parameters:

input – [in] Array-like input
reduction_op – [in] Binary reduction operator

Returns:

Accumulation of type (simplified) decltype(reduction_op(a, b)) see P2322