Block-Wide “Collective” Primitives
CUB block-level algorithms are specialized for execution by threads in the same CUDA thread block:
cub::BlockAdjacentDifference
computes the difference between adjacent items partitioned across a CUDA thread blockcub::BlockDiscontinuity
flags discontinuities within an ordered set of items partitioned across a CUDA thread blockcub::BlockExchange
rearranges data partitioned across a CUDA thread blockcub::BlockHistogram
constructs block-wide histograms from data samples partitioned across a CUDA thread blockcub::BlockLoad
loads a linear segment of items from memory into a CUDA thread blockcub::BlockMergeSort
sorts items partitioned across a CUDA thread blockcub::BlockRadixSort
sorts items partitioned across a CUDA thread block using radix sorting methodcub::BlockReduce
computes reduction of items partitioned across a CUDA thread blockcub::BlockRunLengthDecode
decodes a run-length encoded sequence partitioned across a CUDA thread blockcub::BlockScan
computes a prefix scan of items partitioned across a CUDA thread blockcub::BlockShuffle
shifts items partitioned across a CUDA thread blockcub::BlockStore
stores items partitioned across a CUDA thread block to a linear segment of memory