cub::ThreadReduce

Defined in cub/thread/thread_reduce.cuh

template<typename Input, typename ReductionOp, typename ValueT = random_access_value_t<Input>, typename AccumT = ::cuda::std::__accumulator_t<ReductionOp, ValueT>>
AccumT cub::ThreadReduce(const Input &input, ReductionOp reduction_op)

Reduction over statically-sized array-like types.

The ThreadReduce function computes a reduction of items assigned to a single CUDA thread.

Overview

  • A reduction (or fold) uses a binary combining operator to compute a single aggregate from a list of input elements.

  • Supports array-like types that are statically-sized and can be indexed with the [] operator: raw arrays, std::array, std::span, std::mdspan (C++23)

Overloading

Reduction over statically-sized array-like types, seeded with the specified prefix

template <typename Input,
          typename ReductionOp,
          typename PrefixT,
          typename ValueT = ::cuda::std::remove_cvref_t<decltype(::cuda::std::declval<Input>()[0])>,
          typename AccumT = ::cuda::std::__accumulator_t<ReductionOp, ValueT, PrefixT>>
_CCCL_NODISCARD _CCCL_DEVICE _CCCL_FORCEINLINE AccumT
ThreadReduce(const Input& input, ReductionOp reduction_op, PrefixT prefix)

Performance Considerations

The function provides the following optimizations

  • Vectorization/SIMD for

    • Sum (cuda::std::plus<>) and Multiplication (cuda::std::multiplies<>) on SM70+ for __half data type

    • Sum (cuda::std::plus<>) and Multiplication (cuda::std::multiplies<>) on SM80+ for __nv_bfloat16 data type

    • Minimum (cuda::minimum<>) and Maximum (cuda::maximum<>) on SM80+ for __half and __nv_bfloat16 data types

    • Minimum (cuda::minimum<>) and Maximum (cuda::maximum<>) on SM90+ for int16_t and uint16_t data types (Hopper DPX instructions)

  • Instruction-Level Parallelism (ILP) by exploiting a ternary tree reduction for

    • Sum (cuda::std::plus<>), Bitwise AND (cuda::std::bit_and<>), OR (cuda::std::bit_or<>), XOR (cuda::std::bit_xor<>) on SM50+ for integer data types

    • Minimum (cuda::minimum<>) and Maximum (cuda::maximum<>) on SM80+ for integer data types (Hopper DPX instructions), __half2, __nv_bfloat162, __half (after vectorization), and __nv_bfloat16 (after vectorization) data types

    • Minimum (cuda::minimum<>) and Maximum (cuda::maximum<>) on SM90+ for integer data types (Hopper DPX instructions)

  • Instruction-Level Parallelism (ILP) by exploiting a binary tree reduction for all other cases

Simple Example

The code snippet below illustrates a simple sum reductions over 4 integer values.

#include <cub/cub.cuh>

__global__ void ExampleKernel(...)
{
    int array[4] = {1, 2, 3, 4};
    int sum      = cub::ThreadReduce(array, ::cuda::std::plus<>()); // sum = 10

Template Parameters
  • Input[inferred] The data type to be reduced having member operator[](int i) and must be statically-sized (size() method or static array)

  • ReductionOp[inferred] Binary reduction operator type having member T operator()(const T &a, const T &b)

Parameters
  • input[in] Array=like input

  • reduction_op[in] Binary reduction operator

Returns

Aggregate of type cuda::std::__accumulator_t<ReductionOp, ValueT, PrefixT>