cub::WarpStore
Defined in cub/warp/warp_store.cuh
-
template<typename T, int ITEMS_PER_THREAD, WarpStoreAlgorithm ALGORITHM = WARP_STORE_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int LEGACY_PTX_ARCH = 0>
class WarpStore The WarpStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA warp to a linear segment of memory.
Overview
The WarpStore class provides a single data movement abstraction that can be specialized to implement different cub::WarpStoreAlgorithm strategies. This facilitates different performance policies for different architectures, data types, granularity sizes, etc.
WarpStore can be optionally specialized by different data movement strategies:
cub::WARP_STORE_DIRECT
: a blocked arrangement of data is written directly to memory.cub::WARP_STORE_STRIPED
: a striped arrangement of data is written directly to memory.cub::WARP_STORE_VECTORIZE
: a blocked arrangement of data is written directly to memory using CUDA’s built-in vectorized stores as a coalescing optimization.cub::WARP_STORE_TRANSPOSE
: a blocked arrangement is locally transposed into a striped arrangement which is then written to memory.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A Simple Example
The code snippet below illustrates the storing of a “blocked” arrangement of 64 integers across 16 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for
WARP_STORE_TRANSPOSE
, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.#include <cub/cub.cuh> // or equivalently <cub/warp/warp_store.cuh> __global__ void ExampleKernel(int *d_data, ...) { constexpr int warp_threads = 16; constexpr int block_threads = 256; constexpr int items_per_thread = 4; // Specialize WarpStore for a virtual warp of 16 threads owning 4 integer items each using WarpStoreT = WarpStore<int, items_per_thread, cub::WARP_STORE_TRANSPOSE, warp_threads>; constexpr int warps_in_block = block_threads / warp_threads; constexpr int tile_size = items_per_thread * warp_threads; const int warp_id = static_cast<int>(threadIdx.x) / warp_threads; // Allocate shared memory for WarpStore __shared__ typename WarpStoreT::TempStorage temp_storage[warps_in_block]; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Store items to linear memory WarpStoreT(temp_storage[warp_id]).Store(d_data + warp_id * tile_size, thread_data);
Suppose the set of
thread_data
across the warp threads is{ [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }
. The outputd_data
will be0, 1, 2, 3, 4, 5, ...
.- Template Parameters
T – The type of data to be written.
ITEMS_PER_THREAD – The number of consecutive items partitioned onto each thread.
ALGORITHM – [optional] cub::WarpStoreAlgorithm tuning policy enumeration. default: cub::WARP_STORE_DIRECT.
LOGICAL_WARP_THREADS – [optional] The number of threads per “logical” warp (may be less than the number of hardware warp threads). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM86). Must be a power of two.
LEGACY_PTX_ARCH – Unused.
Collective constructors
-
inline WarpStore()
Collective constructor using a private static allocation of shared memory as temporary storage.
-
inline WarpStore(TempStorage &temp_storage)
Collective constructor using the specified memory allocation as temporary storage.
Data movement
-
template<typename OutputIteratorT>
inline void Store(OutputIteratorT block_itr, T (&items)[ITEMS_PER_THREAD]) Store items into a linear segment of memory.
A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.Snippet
The code snippet below illustrates the storing of a “blocked” arrangement of 64 integers across 16 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for
WARP_STORE_TRANSPOSE
, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.#include <cub/cub.cuh> // or equivalently <cub/warp/warp_store.cuh> __global__ void ExampleKernel(int *d_data, ...) { constexpr int warp_threads = 16; constexpr int block_threads = 256; constexpr int items_per_thread = 4; // Specialize WarpStore for a virtual warp of 16 threads owning 4 integer items each using WarpStoreT = WarpStore<int, items_per_thread, cub::WARP_STORE_TRANSPOSE, warp_threads>; constexpr int warps_in_block = block_threads / warp_threads; constexpr int tile_size = items_per_thread * warp_threads; const int warp_id = static_cast<int>(threadIdx.x) / warp_threads; // Allocate shared memory for WarpStore __shared__ typename WarpStoreT::TempStorage temp_storage[warps_in_block]; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Store items to linear memory WarpStoreT(temp_storage[warp_id]).Store(d_data + warp_id * tile_size, thread_data);
Suppose the set of
thread_data
across the warp threads is{ [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }
. The outputd_data
will be0, 1, 2, 3, 4, 5, ...
.- Parameters
block_itr – [out] The thread block’s base output iterator for storing to
items – [in] Data to store
-
template<typename OutputIteratorT>
inline void Store(OutputIteratorT block_itr, T (&items)[ITEMS_PER_THREAD], int valid_items) Store items into a linear segment of memory, guarded by range.
A subsequent
__syncwarp()
warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.Snippet
The code snippet below illustrates the storing of a “blocked” arrangement of 64 integers across 16 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for
WARP_STORE_TRANSPOSE
, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.#include <cub/cub.cuh> // or equivalently <cub/warp/warp_store.cuh> __global__ void ExampleKernel(int *d_data, int valid_items ...) { constexpr int warp_threads = 16; constexpr int block_threads = 256; constexpr int items_per_thread = 4; // Specialize WarpStore for a virtual warp of 16 threads owning 4 integer items each using WarpStoreT = WarpStore<int, items_per_thread, cub::WARP_STORE_TRANSPOSE, warp_threads>; constexpr int warps_in_block = block_threads / warp_threads; constexpr int tile_size = items_per_thread * warp_threads; const int warp_id = static_cast<int>(threadIdx.x) / warp_threads; // Allocate shared memory for WarpStore __shared__ typename WarpStoreT::TempStorage temp_storage[warps_in_block]; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Store items to linear memory WarpStoreT(temp_storage[warp_id]).Store( d_data + warp_id * tile_size, thread_data, valid_items);
Suppose the set of
thread_data
across the warp threads is{ [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }
andvalid_items
is5
. The outputd_data
will be0, 1, 2, 3, 4, ?, ?, ...
, with only the first two threads being unmasked to store portions of valid data.- Parameters
block_itr – [out] The thread block’s base output iterator for storing to
items – [in] Data to store
valid_items – [in] Number of valid items to write
-
struct TempStorage : public Uninitialized<_TempStorage>