cub::BlockScanAlgorithm

enum cub::BlockScanAlgorithm

BlockScanAlgorithm enumerates alternative algorithms for cub::BlockScan to compute a parallel prefix scan across a CUDA thread block.

Values:

enumerator BLOCK_SCAN_RAKING

Overview

An efficient “raking reduce-then-scan” prefix scan algorithm. Execution is comprised of five phases:

Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
Upsweep sequential reduction in shared memory. Threads within a single warp rake across segments of shared partial reductions.
A warp-synchronous Kogge-Stone style exclusive scan within the raking warp.
Downsweep sequential exclusive scan in shared memory. Threads within a single warp rake across segments of shared partial reductions, seeded with the warp-scan output.
Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.

Performance Considerations

Although this variant may suffer longer turnaround latencies when the GPU is under-occupied, it can often provide higher overall throughput across the GPU when suitably occupied.

enumerator BLOCK_SCAN_RAKING_MEMOIZE

Overview

Similar to cub::BLOCK_SCAN_RAKING, but with fewer shared memory reads at the expense of higher register pressure. Raking threads preserve their “upsweep” segment of values in registers while performing warp-synchronous scan, allowing the “downsweep” not to re-read them from shared memory.

enumerator BLOCK_SCAN_WARP_SCANS

Overview

A quick “tiled warpscans” prefix scan algorithm. Execution is comprised of four phases:

Upsweep sequential reduction in registers (if threads contribute more than one input each). Each thread then places the partial reduction of its item(s) into shared memory.
Compute a shallow, but inefficient warp-synchronous Kogge-Stone style scan within each warp.
A propagation phase where the warp scan outputs in each warp are updated with the aggregate from each preceding warp.
Downsweep sequential scan in registers (if threads contribute more than one input), seeded with the raking scan output.

Performance Considerations

Although this variant may suffer lower overall throughput across the GPU because due to a heavy reliance on inefficient warpscans, it can often provide lower turnaround latencies when the GPU is under-occupied.