cub::WarpStore

Defined in cub/warp/warp_store.cuh

template<typename T, int ITEMS_PER_THREAD, WarpStoreAlgorithm ALGORITHM = WARP_STORE_DIRECT, int LOGICAL_WARP_THREADS = CUB_PTX_WARP_THREADS, int LEGACY_PTX_ARCH = 0>
class WarpStore

The WarpStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA warp to a linear segment of memory.

Overview

A Simple Example

The code snippet below illustrates the storing of a “blocked” arrangement of 64 integers across 16 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for WARP_STORE_TRANSPOSE, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.

#include <cub/cub.cuh>   // or equivalently <cub/warp/warp_store.cuh>

__global__ void ExampleKernel(int *d_data, ...)
{
    constexpr int warp_threads = 16;
    constexpr int block_threads = 256;
    constexpr int items_per_thread = 4;

    // Specialize WarpStore for a virtual warp of 16 threads owning 4 integer items each
    using WarpStoreT = WarpStore<int,
                                 items_per_thread,
                                 cub::WARP_STORE_TRANSPOSE,
                                 warp_threads>;

    constexpr int warps_in_block = block_threads / warp_threads;
    constexpr int tile_size = items_per_thread * warp_threads;
    const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;

    // Allocate shared memory for WarpStore
    __shared__ typename WarpStoreT::TempStorage temp_storage[warps_in_block];

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Store items to linear memory
    WarpStoreT(temp_storage[warp_id]).Store(d_data + warp_id * tile_size, thread_data);

Suppose the set of thread_data across the warp threads is { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }. The output d_data will be 0, 1, 2, 3, 4, 5, ....

Template Parameters
  • T – The type of data to be written.

  • ITEMS_PER_THREAD – The number of consecutive items partitioned onto each thread.

  • ALGORITHM[optional] cub::WarpStoreAlgorithm tuning policy enumeration. default: cub::WARP_STORE_DIRECT.

  • LOGICAL_WARP_THREADS[optional] The number of threads per “logical” warp (may be less than the number of hardware warp threads). Default is the warp size of the targeted CUDA compute-capability (e.g., 32 threads for SM86). Must be a power of two.

  • LEGACY_PTX_ARCH – Unused.

Collective constructors

inline WarpStore()

Collective constructor using a private static allocation of shared memory as temporary storage.

inline WarpStore(TempStorage &temp_storage)

Collective constructor using the specified memory allocation as temporary storage.

Data movement

template<typename OutputIteratorT>
inline void Store(OutputIteratorT block_itr, T (&items)[ITEMS_PER_THREAD])

Store items into a linear segment of memory.

A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates the storing of a “blocked” arrangement of 64 integers across 16 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for WARP_STORE_TRANSPOSE, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.

#include <cub/cub.cuh>   // or equivalently <cub/warp/warp_store.cuh>

__global__ void ExampleKernel(int *d_data, ...)
{
    constexpr int warp_threads = 16;
    constexpr int block_threads = 256;
    constexpr int items_per_thread = 4;

    // Specialize WarpStore for a virtual warp of 16 threads owning 4 integer items each
    using WarpStoreT = WarpStore<int,
                                 items_per_thread,
                                 cub::WARP_STORE_TRANSPOSE,
                                 warp_threads>;

    constexpr int warps_in_block = block_threads / warp_threads;
    constexpr int tile_size = items_per_thread * warp_threads;
    const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;

    // Allocate shared memory for WarpStore
    __shared__ typename WarpStoreT::TempStorage temp_storage[warps_in_block];

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Store items to linear memory
    WarpStoreT(temp_storage[warp_id]).Store(d_data + warp_id * tile_size, thread_data);

Suppose the set of thread_data across the warp threads is { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] }. The output d_data will be 0, 1, 2, 3, 4, 5, ....

Parameters
  • block_itr[out] The thread block’s base output iterator for storing to

  • items[in] Data to store

template<typename OutputIteratorT>
inline void Store(OutputIteratorT block_itr, T (&items)[ITEMS_PER_THREAD], int valid_items)

Store items into a linear segment of memory, guarded by range.

A subsequent __syncwarp() warp-wide barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates the storing of a “blocked” arrangement of 64 integers across 16 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for WARP_STORE_TRANSPOSE, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.

#include <cub/cub.cuh>   // or equivalently <cub/warp/warp_store.cuh>

__global__ void ExampleKernel(int *d_data, int valid_items ...)
{
    constexpr int warp_threads = 16;
    constexpr int block_threads = 256;
    constexpr int items_per_thread = 4;

    // Specialize WarpStore for a virtual warp of 16 threads owning 4 integer items each
    using WarpStoreT = WarpStore<int,
                                 items_per_thread,
                                 cub::WARP_STORE_TRANSPOSE,
                                 warp_threads>;

    constexpr int warps_in_block = block_threads / warp_threads;
    constexpr int tile_size = items_per_thread * warp_threads;
    const int warp_id = static_cast<int>(threadIdx.x) / warp_threads;

    // Allocate shared memory for WarpStore
    __shared__ typename WarpStoreT::TempStorage temp_storage[warps_in_block];

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Store items to linear memory
    WarpStoreT(temp_storage[warp_id]).Store(
      d_data + warp_id * tile_size, thread_data, valid_items);

Suppose the set of thread_data across the warp threads is { [0,1,2,3], [4,5,6,7], ..., [60,61,62,63] } and valid_items is 5. The output d_data will be 0, 1, 2, 3, 4, ?, ?, ..., with only the first two threads being unmasked to store portions of valid data.

Parameters
  • block_itr[out] The thread block’s base output iterator for storing to

  • items[in] Data to store

  • valid_items[in] Number of valid items to write

struct TempStorage : public Uninitialized<_TempStorage>