Warp-Wide “Collective” Primitives

CUB warp-level algorithms are specialized for execution by threads in the same CUDA warp. These algorithms may only be invoked by 1 <= n <= 32 consecutive threads in the same warp: