Warp-Wide “Collective” Primitives#
CUB warp-level algorithms are specialized for execution by threads in the same CUDA warp.
These algorithms may only be invoked by 1 <= n <= 32 consecutive threads in the same warp:
cub::WarpExchangerearranges data partitioned across a CUDA warpcub::WarpLoadloads a linear segment of items from memory into a CUDA warpcub::WarpMergeSortsorts items partitioned across a CUDA warpcub::WarpReducecomputes reduction of items partitioned across a CUDA warpcub::WarpScancomputes a prefix scan of items partitioned across a CUDA warpcub::WarpStorestores items partitioned across a CUDA warp to a linear segment of memory