Block-Wide “Collective” Primitives#
CUB block-level algorithms are specialized for execution by threads in the same CUDA thread block:
cub::BlockAdjacentDifferencecomputes the difference between adjacent items partitioned across a CUDA thread blockcub::BlockDiscontinuityflags discontinuities within an ordered set of items partitioned across a CUDA thread blockcub::BlockExchangerearranges data partitioned across a CUDA thread blockcub::BlockHistogramconstructs block-wide histograms from data samples partitioned across a CUDA thread blockcub::BlockLoadloads a linear segment of items from memory into a CUDA thread blockcub::BlockMergeSortsorts items partitioned across a CUDA thread blockcub::BlockRadixSortsorts items partitioned across a CUDA thread block using radix sorting methodcub::BlockReducecomputes reduction of items partitioned across a CUDA thread blockcub::BlockRunLengthDecodedecodes a run-length encoded sequence partitioned across a CUDA thread blockcub::BlockScancomputes a prefix scan of items partitioned across a CUDA thread blockcub::BlockShuffleshifts items partitioned across a CUDA thread blockcub::BlockStorestores items partitioned across a CUDA thread block to a linear segment of memory