Block-Wide “Collective” Primitives

CUB block-level algorithms are specialized for execution by threads in the same CUDA thread block:

cub::BlockAdjacentDifference computes the difference between adjacent items partitioned across a CUDA thread block
cub::BlockDiscontinuity flags discontinuities within an ordered set of items partitioned across a CUDA thread block
cub::BlockExchange rearranges data partitioned across a CUDA thread block
cub::BlockHistogram constructs block-wide histograms from data samples partitioned across a CUDA thread block
cub::BlockLoad loads a linear segment of items from memory into a CUDA thread block
cub::BlockMergeSort sorts items partitioned across a CUDA thread block
cub::BlockRadixSort sorts items partitioned across a CUDA thread block using radix sorting method
cub::BlockReduce computes reduction of items partitioned across a CUDA thread block
cub::BlockRunLengthDecode decodes a run-length encoded sequence partitioned across a CUDA thread block
cub::BlockScan computes a prefix scan of items partitioned across a CUDA thread block
cub::BlockShuffle shifts items partitioned across a CUDA thread block
cub::BlockStore stores items partitioned across a CUDA thread block to a linear segment of memory