cub::BlockReduceAlgorithm

Defined in /home/runner/work/cccl/cccl/cub/cub/block/block_reduce.cuh

enum cub::BlockReduceAlgorithm

BlockReduceAlgorithm enumerates alternative algorithms for parallel reduction across a CUDA thread block.

Values:

enumerator BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY

Overview

An efficient “raking” reduction algorithm that only supports commutative reduction operators (true for most operations, e.g., addition).

Execution is comprised of three phases:
  1. Upsweep sequential reduction in registers (if threads contribute more than one input each). Threads in warps other than the first warp place their partial reductions into shared memory.

  2. Upsweep sequential reduction in shared memory. Threads within the first warp continue to accumulate by raking across segments of shared partial reductions

  3. A warp-synchronous Kogge-Stone style reduction within the raking warp.

Performance Considerations

  • This variant performs less communication than BLOCK_REDUCE_RAKING_NON_COMMUTATIVE and is preferable when the reduction operator is commutative. This variant applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall throughput across the GPU when suitably occupied. However, turn-around latency may be higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable when the GPU is under-occupied.

enumerator BLOCK_REDUCE_RAKING

Overview

An efficient “raking” reduction algorithm that supports commutative (e.g., addition) and non-commutative (e.g., string concatenation) reduction operators. Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed..

Execution is comprised of three phases:
  1. Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.

  2. Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions.

  3. A warp-synchronous Kogge-Stone style reduction within the raking warp.

Performance Considerations

  • This variant performs more communication than BLOCK_REDUCE_RAKING and is only preferable when the reduction operator is non-commutative. This variant applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall throughput across the GPU when suitably occupied. However, turn-around latency may be higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable when the GPU is under-occupied.

enumerator BLOCK_REDUCE_WARP_REDUCTIONS

Overview

A quick “tiled warp-reductions” reduction algorithm that supports commutative (e.g., addition) and non-commutative (e.g., string concatenation) reduction operators.

Execution is comprised of four phases:
  1. Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.

  2. Compute a shallow, but inefficient warp-synchronous Kogge-Stone style reduction within each warp.

  3. A propagation phase where the warp reduction outputs in each warp are updated with the aggregate from each preceding warp.

Performance Considerations

  • This variant applies more reduction operators than BLOCK_REDUCE_RAKING or BLOCK_REDUCE_RAKING_NON_COMMUTATIVE, which may result in lower overall throughput across the GPU. However turn-around latency may be lower and thus useful when the GPU is under-occupied.