cub::ThreadReduce
Defined in cub/thread/thread_reduce.cuh
-
template<typename Input, typename ReductionOp, typename ValueT = random_access_value_t<Input>, typename AccumT = ::cuda::std::__accumulator_t<ReductionOp, ValueT>>
AccumT cub::ThreadReduce(const Input &input, ReductionOp reduction_op) Reduction over statically-sized array-like types.
The
ThreadReduce
function computes a reduction of items assigned to a single CUDA thread.Overview
A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.
Supports array-like types that are statically-sized and can be indexed with the
[] operator
: raw arrays,std::array
,std::span
,std::mdspan
(C++23)
Overloading
Reduction over statically-sized array-like types, seeded with the specified prefix
template <typename Input, typename ReductionOp, typename PrefixT, typename ValueT = ::cuda::std::remove_cvref_t<decltype(::cuda::std::declval<Input>()[0])>, typename AccumT = ::cuda::std::__accumulator_t<ReductionOp, ValueT, PrefixT>> _CCCL_NODISCARD _CCCL_DEVICE _CCCL_FORCEINLINE AccumT ThreadReduce(const Input& input, ReductionOp reduction_op, PrefixT prefix)
Performance Considerations
The function provides the following optimizations
Vectorization/SIMD for
Sum (
cuda::std::plus<>
) and Multiplication (cuda::std::multiplies<>
) on SM70+ for__half
data typeSum (
cuda::std::plus<>
) and Multiplication (cuda::std::multiplies<>
) on SM80+ for__nv_bfloat16
data typeMinimum (
cuda::minimum<>
) and Maximum (cuda::maximum<>
) on SM80+ for__half
and__nv_bfloat16
data typesMinimum (
cuda::minimum<>
) and Maximum (cuda::maximum<>
) on SM90+ forint16_t
anduint16_t
data types (Hopper DPX instructions)
Instruction-Level Parallelism (ILP) by exploiting a ternary tree reduction for
Sum (
cuda::std::plus<>
), Bitwise AND (cuda::std::bit_and<>
), OR (cuda::std::bit_or<>
), XOR (cuda::std::bit_xor<>
) on SM50+ for integer data typesMinimum (
cuda::minimum<>
) and Maximum (cuda::maximum<>
) on SM80+ for integer data types (Hopper DPX instructions),__half2
,__nv_bfloat162
,__half
(after vectorization), and__nv_bfloat16
(after vectorization) data typesMinimum (
cuda::minimum<>
) and Maximum (cuda::maximum<>
) on SM90+ for integer data types (Hopper DPX instructions)
Instruction-Level Parallelism (ILP) by exploiting a binary tree reduction for all other cases
Simple Example
The code snippet below illustrates a simple sum reductions over 4 integer values.
#include <cub/cub.cuh> __global__ void ExampleKernel(...) { int array[4] = {1, 2, 3, 4}; int sum = cub::ThreadReduce(array, ::cuda::std::plus<>()); // sum = 10
- Template Parameters
Input – [inferred] The data type to be reduced having member
operator[](int i)
and must be statically-sized (size() method or static array)ReductionOp – [inferred] Binary reduction operator type having member
T operator()(const T &a, const T &b)
- Parameters
input – [in] Array=like input
reduction_op – [in] Binary reduction operator
- Returns
Aggregate of type
cuda::std::__accumulator_t<ReductionOp, ValueT, PrefixT>