cub::BlockReduceAlgorithm

enum cub::BlockReduceAlgorithm

BlockReduceAlgorithm enumerates alternative algorithms for parallel reduction across a CUDA thread block.

Values:

enumerator BLOCK_REDUCE_RAKING_COMMUTATIVE_ONLY

Overview

An efficient “raking” reduction algorithm that only supports commutative reduction operators (true for most operations, e.g., addition).

Execution is comprised of three phases:

Upsweep sequential reduction in registers (if threads contribute more than one input each). Threads in warps other than the first warp place their partial reductions into shared memory.
Upsweep sequential reduction in shared memory. Threads within the first warp continue to accumulate by raking across segments of shared partial reductions
A warp-synchronous Kogge-Stone style reduction within the raking warp.

Performance Considerations

This variant performs less communication than BLOCK_REDUCE_RAKING_NON_COMMUTATIVE and is preferable when the reduction operator is commutative. This variant applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall throughput across the GPU when suitably occupied. However, turn-around latency may be higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable when the GPU is under-occupied.

enumerator BLOCK_REDUCE_RAKING

Overview

An efficient “raking” reduction algorithm that supports commutative (e.g., addition) and non-commutative (e.g., string concatenation) reduction operators. Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where thread_i owns the i^th range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed..

Execution is comprised of three phases:

Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions.
A warp-synchronous Kogge-Stone style reduction within the raking warp.

Performance Considerations

This variant performs more communication than BLOCK_REDUCE_RAKING and is only preferable when the reduction operator is non-commutative. This variant applies fewer reduction operators than BLOCK_REDUCE_WARP_REDUCTIONS, and can provide higher overall throughput across the GPU when suitably occupied. However, turn-around latency may be higher than to BLOCK_REDUCE_WARP_REDUCTIONS and thus less-desirable when the GPU is under-occupied.

enumerator BLOCK_REDUCE_WARP_REDUCTIONS

Overview

A quick “tiled warp-reductions” reduction algorithm that supports commutative (e.g., addition) and non-commutative (e.g., string concatenation) reduction operators.

Execution is comprised of four phases:

Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
Compute a shallow, but inefficient warp-synchronous Kogge-Stone style reduction within each warp.
A propagation phase where the warp reduction outputs in each warp are updated with the aggregate from each preceding warp.

Performance Considerations

This variant applies more reduction operators than BLOCK_REDUCE_RAKING or BLOCK_REDUCE_RAKING_NON_COMMUTATIVE, which may result in lower overall throughput across the GPU. However turn-around latency may be lower and thus useful when the GPU is under-occupied.

enumerator BLOCK_REDUCE_WARP_REDUCTIONS_NONDETERMINISTIC

Overview

A quick “tiled warp-reductions” reduction algorithm that supports commutative (e.g., addition) and non-commutative (e.g., string concatenation) reduction operators. This variant uses atomic operations to reduce the warp-wide reduction results, making it non-deterministic, i.e. the order of reduction operations is not guaranteed to be the same across different invocations of the same kernel.

Execution is comprised of three phases:

Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
Compute a shallow, but non work-efficient warp-synchronous Kogge-Stone style reduction within each warp.
Lane 0 of warp 0 stores its warp aggregate, while lane 0 of other warps use atomic operations to accumulate their warp aggregates into a shared location, making the final order non-deterministic.
The final block-wide result is available to all threads.

Performance Considerations

This variant applies more reduction operators than BLOCK_REDUCE_RAKING or BLOCK_REDUCE_RAKING_NON_COMMUTATIVE, which may result in lower overall throughput across the GPU. However turn-around latency may be lower and thus useful when the GPU is under-occupied.