Block-Wide “Collective” Primitives#

CUB block-level algorithms are specialized for execution by threads in the same CUDA thread block: