Warp-Wide “Collective” Primitives
CUB warp-level algorithms are specialized for execution by threads in the same CUDA warp.
These algorithms may only be invoked by 1 <= n <= 32
consecutive threads in the same warp:
cub::WarpExchange
rearranges data partitioned across a CUDA warpcub::WarpLoad
loads a linear segment of items from memory into a CUDA warpcub::WarpMergeSort
sorts items partitioned across a CUDA warpcub::WarpReduce
computes reduction of items partitioned across a CUDA warpcub::WarpScan
computes a prefix scan of items partitioned across a CUDA warpcub::WarpStore
stores items partitioned across a CUDA warp to a linear segment of memory