cub::WarpStore#

template<typename T, int ITEMS_PER_THREAD, WarpStoreAlgorithm ALGORITHM = WARP_STORE_DIRECT, int LOGICAL_WARP_THREADS = detail::warp_threads> class WarpStore#

The WarpStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA warp to a linear segment of memory.

Overview#

The WarpStore class provides a single data movement abstraction that can be specialized to implement different cub::WarpStoreAlgorithm strategies. This facilitates different performance policies for different architectures, data types, granularity sizes, etc.
WarpStore can be optionally specialized by different data movement strategies:
1. cub::WARP_STORE_DIRECT: a blocked arrangement of data is written directly to memory.
2. cub::WARP_STORE_STRIPED: a striped arrangement of data is written directly to memory.
3. cub::WARP_STORE_VECTORIZE: a blocked arrangement of data is written directly to memory using CUDA’s built-in vectorized stores as a coalescing optimization.
4. cub::WARP_STORE_TRANSPOSE: a blocked arrangement is locally transposed into a striped arrangement which is then written to memory.
For multi-dimensional blocks, threads are linearly ranked in row-major order.

A Simple Example#

The code snippet below illustrates the storing of a “blocked” arrangement of 64 integers across 16 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for WARP_STORE_TRANSPOSE, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.

#include <cub/cub.cuh>   // or equivalently <cub/warp/warp_store.cuh>

__global__ void ExampleKernel(int *d_data, ...)
{
    constexpr int warp_threads = 16;
    constexpr int block_threads = 256;
    constexpr int items_per_thread = 4;

    // Specialize WarpStore for a virtual warp of 16 threads owning 4 integer items each
    using WarpStoreT = WarpStore<int,
                                 items_per_thread,
                                 cub::WARP_STORE_TRANSPOSE,
                                 warp_threads>;

    constexpr int warps_in_block = block_threads / warp_threads;
    constexpr int tile_size = items_per_thread * warp_threads;
    const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;

    // Allocate shared memory for WarpStore
    __shared__ typename WarpStoreT::TempStorage temp_storage[warps_in_block];

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Store items to linear memory
    WarpStoreT(temp_storage[warp_id]).Store(d_data + warp_id * tile_size, thread_data);

Suppose the set of thread_data across the warp threads is { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }. The output d_data will be 0, 1, 2, 3, 4, 5, ....

Template Parameters:

T – The type of data to be written.
ITEMS_PER_THREAD – The number of consecutive items partitioned onto each thread.
ALGORITHM – [optional] cub::WarpStoreAlgorithm tuning policy enumeration. default: cub::WARP_STORE_DIRECT.
LOGICAL_WARP_THREADS – [optional] The number of threads per “logical” warp (may be less than the number of hardware warp threads). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM86). Must be a power of two.

Collective constructors

inline WarpStore()#: Collective constructor using a private static allocation of shared memory as temporary storage.

inline WarpStore(TempStorage &temp_storage)#: Collective constructor using the specified memory allocation as temporary storage.

Data movement

end member group

template<typename OutputIteratorT> inline void Store( OutputIteratorT block_itr, T (&items)[ITEMS_PER_THREAD], )#

Store items into a linear segment of memory.

A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates the storing of a “blocked” arrangement of 64 integers across 16 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for WARP_STORE_TRANSPOSE, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.

#include <cub/cub.cuh>   // or equivalently <cub/warp/warp_store.cuh>

__global__ void ExampleKernel(int *d_data, ...)
{
    constexpr int warp_threads = 16;
    constexpr int block_threads = 256;
    constexpr int items_per_thread = 4;

    // Specialize WarpStore for a virtual warp of 16 threads owning 4 integer items each
    using WarpStoreT = WarpStore<int,
                                 items_per_thread,
                                 cub::WARP_STORE_TRANSPOSE,
                                 warp_threads>;

    constexpr int warps_in_block = block_threads / warp_threads;
    constexpr int tile_size = items_per_thread * warp_threads;
    const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;

    // Allocate shared memory for WarpStore
    __shared__ typename WarpStoreT::TempStorage temp_storage[warps_in_block];

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Store items to linear memory
    WarpStoreT(temp_storage[warp_id]).Store(d_data + warp_id * tile_size, thread_data);

Suppose the set of thread_data across the warp threads is { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }. The output d_data will be 0, 1, 2, 3, 4, 5, ....

Parameters:

block_itr – [out] The thread block’s base output iterator for storing to
items – [in] Data to store

template<typename OutputIteratorT> inline void Store( OutputIteratorT block_itr, T (&items)[ITEMS_PER_THREAD], int valid_items, )#

Store items into a linear segment of memory, guarded by range.

A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates the storing of a “blocked” arrangement of 64 integers across 16 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for WARP_STORE_TRANSPOSE, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.

#include <cub/cub.cuh>   // or equivalently <cub/warp/warp_store.cuh>

__global__ void ExampleKernel(int *d_data, int valid_items ...)
{
    constexpr int warp_threads = 16;
    constexpr int block_threads = 256;
    constexpr int items_per_thread = 4;

    // Specialize WarpStore for a virtual warp of 16 threads owning 4 integer items each
    using WarpStoreT = WarpStore<int,
                                 items_per_thread,
                                 cub::WARP_STORE_TRANSPOSE,
                                 warp_threads>;

    constexpr int warps_in_block = block_threads / warp_threads;
    constexpr int tile_size = items_per_thread * warp_threads;
    const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;

    // Allocate shared memory for WarpStore
    __shared__ typename WarpStoreT::TempStorage temp_storage[warps_in_block];

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Store items to linear memory
    WarpStoreT(temp_storage[warp_id]).Store(
      d_data + warp_id * tile_size, thread_data, valid_items);

Suppose the set of thread_data across the warp threads is { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] } and valid_items is 5. The output d_data will be 0, 1, 2, 3, 4, ?, ?, ..., with only the first two threads being unmasked to store portions of valid data.

Parameters:

block_itr – [out] The thread block’s base output iterator for storing to
items – [in] Data to store
valid_items – [in] Number of valid items to write

struct TempStorage : public Uninitialized<_TempStorage>#