cub::ThreadReduce#
-
template<typename Input, typename ReductionOp, typename ValueT = ::cuda::std::iter_value_t<Input>, typename AccumT = ::cuda::std::__accumulator_t<ReductionOp, ValueT>>
AccumT cub::ThreadReduce( - const Input &input,
- ReductionOp reduction_op,
Reduction over statically-sized array-like types.
The
ThreadReducefunction computes a reduction of items assigned to a single CUDA thread.Overview#
A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.
Supports array-like types that are statically-sized and can be indexed with the
[] operator: raw arrays,std::array,std::span,std::mdspan(C++23)
Main Function and Overloading#
Reduction over statically-sized array-like types, seeded with the specified prefix
template <typename Input, typename ReductionOp, typename ValueT = ..., // type of a single input element typename AccumT = ...> // accumulator type [[nodiscard]] __device__ __forceinline__ AccumT ThreadReduce(const Input& input, ReductionOp reduction_op)
template <typename Input, typename ReductionOp, typename PrefixT, typename ValueT = ..., // type of a single input element typename AccumT = ...> // accumulator type [[nodiscard]] __device__ __forceinline__ AccumT ThreadReduce(const Input& input, ReductionOp reduction_op, PrefixT prefix)
Performance Considerations#
The function provides the following optimizations:
Vectorization/SIMD for:
Minimum (
cuda::minimum<>) and Maximum (cuda::maximum<>) on SM90+ forint16_t/uint16_tdata types (Hopper DPX instructions)Sum (
cuda::std::plus<>) and Multiplication (cuda::std::multiplies<>) on SM80+ for__nv_bfloat16data typeMinimum (
cuda::minimum<>) and Maximum (cuda::maximum<>) on SM80+ for__half/__nv_bfloat16data typesSum (
cuda::std::plus<>) and Multiplication (cuda::std::multiplies<>) on SM70+ for__halfdata type
Instruction-Level Parallelism (ILP) by exploiting a ternary tree reduction for:
Minimum (
cuda::minimum<>) and Maximum (cuda::maximum<>) on SM90+ forint32_t/uint32_tdata types (Hopper DPX instructions)Minimum (
cuda::minimum<>) and Maximum (cuda::maximum<>) on SM80+ for integer data types (Hopper DPX instructions),__half2,__nv_bfloat162,__half(after vectorization), and__nv_bfloat16(after vectorization) data typesSum (
cuda::std::plus<>), Bitwise AND (cuda::std::bit_and<>), OR (cuda::std::bit_or<>), XOR (cuda::std::bit_xor<>) on SM50+ for integer data types
Instruction-Level Parallelism (ILP) by exploiting a binary tree reduction for
All other cases that maps to predefined operators
Simple Example#
The code snippet below illustrates a simple sum reductions over 4 integer values.
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { int array[4] = {1, 2, 3, 4}; int sum = cub::ThreadReduce(array, ::cuda::std::plus<>{}); // sum = 10 }
- Template Parameters:
Input – [inferred] The data type to be reduced having member
operator[](int i)and must be statically-sized (size() method or static array)ReductionOp – [inferred] Binary reduction operator type having member
T operator()(const T &a, const T &b)
- Parameters:
input – [in] Array-like input
reduction_op – [in] Binary reduction operator
- Returns:
Accumulation of type (simplified)
decltype(reduction_op(a, b))see P2322