cub::BlockStore
Defined in cub/block/block_store.cuh
-
template<typename T, int BLOCK_DIM_X, int ITEMS_PER_THREAD, BlockStoreAlgorithm ALGORITHM = BLOCK_STORE_DIRECT, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int LEGACY_PTX_ARCH = 0>
class BlockStore The BlockStore class provides collective data movement methods for writing a blocked arrangement of items partitioned across a CUDA thread block to a linear segment of memory.
Overview
The BlockStore class provides a single data movement abstraction that can be specialized to implement different cub::BlockStoreAlgorithm strategies. This facilitates different performance policies for different architectures, data types, granularity sizes, etc.
BlockStore can be optionally specialized by different data movement strategies:
cub::BLOCK_STORE_DIRECT
: A blocked arrangement of data is written directly to memory.cub::BLOCK_STORE_STRIPED
: A striped arrangement of data is written directly to memory.cub::BLOCK_STORE_VECTORIZE
: A blocked arrangement of data is written directly to memory using CUDA’s built-in vectorized stores as a coalescing optimization.cub::BLOCK_STORE_TRANSPOSE
: A blocked arrangement is locally transposed into a striped arrangement which is then written to memory.cub::BLOCK_STORE_WARP_TRANSPOSE
: A blocked arrangement is locally transposed into a warp-striped arrangement which is then written to memory.cub::BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED
: A blocked arrangement is locally transposed into a warp-striped arrangement which is then written to memory. To reduce the shared memory requireent, only one warp’s worth of shared memory is provisioned and is subsequently time-sliced among warps.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A Simple Example
Every thread in the block uses the BlockStore class by first specializing the BlockStore type, then instantiating an instance with parameters for communication, and finally invoking one or more collective member functions.
The code snippet below illustrates the storing of a “blocked” arrangement of 512 integers across 128 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for
BLOCK_STORE_WARP_TRANSPOSE
, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.#include <cub/cub.cuh> // or equivalently <cub/block/block_store.cuh> __global__ void ExampleKernel(int *d_data, ...) { // Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each using BlockStore = cub::BlockStore<int, 128, 4, BLOCK_STORE_WARP_TRANSPOSE>; // Allocate shared memory for BlockStore __shared__ typename BlockStore::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Store items to linear memory BlockStore(temp_storage).Store(d_data, thread_data);
Suppose the set of
thread_data
across the block of threads is{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
. The outputd_data
will be0, 1, 2, 3, 4, 5, ...
.- Template Parameters
T – The type of data to be written.
BLOCK_DIM_X – The thread block length in threads along the X dimension
ITEMS_PER_THREAD – The number of consecutive items partitioned onto each thread.
ALGORITHM – [optional] cub::BlockStoreAlgorithm tuning policy enumeration (default: cub::BLOCK_STORE_DIRECT)
BLOCK_DIM_Y – [optional] The thread block length in threads along the Y dimension (default: 1)
BLOCK_DIM_Z – [optional] The thread block length in threads along the Z dimension (default: 1)
LEGACY_PTX_ARCH – [optional] Unused.
Collective constructors
-
inline BlockStore()
Collective constructor using a private static allocation of shared memory as temporary storage.
-
inline BlockStore(TempStorage &temp_storage)
Collective constructor using the specified memory allocation as temporary storage.
- Parameters
temp_storage[in] – Reference to memory allocation having layout type TempStorage
Data movement
-
template<typename OutputIteratorT>
inline void Store(OutputIteratorT block_itr, T (&items)[ITEMS_PER_THREAD]) Store items into a linear segment of memory
Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates the storing of a “blocked” arrangement of 512 integers across 128 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for
BLOCK_STORE_WARP_TRANSPOSE
, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.#include <cub/cub.cuh> // or equivalently <cub/block/block_store.cuh> __global__ void ExampleKernel(int *d_data, ...) { // Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each using BlockStore = cub::BlockStore<int, 128, 4, BLOCK_STORE_WARP_TRANSPOSE>; // Allocate shared memory for BlockStore __shared__ typename BlockStore::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Store items to linear memory int thread_data[4]; BlockStore(temp_storage).Store(d_data, thread_data);
Suppose the set of
thread_data
across the block of threads is{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
. The outputd_data
will be0, 1, 2, 3, 4, 5, ...
.- Parameters
block_itr[out] – The thread block’s base output iterator for storing to
items[in] – Data to store
-
template<typename OutputIteratorT>
inline void Store(OutputIteratorT block_itr, T (&items)[ITEMS_PER_THREAD], int valid_items) Store items into a linear segment of memory, guarded by range.
Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where threadi owns the ith range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates the guarded storing of a “blocked” arrangement of 512 integers across 128 threads (where each thread owns 4 consecutive items) into a linear segment of memory. The store is specialized for
BLOCK_STORE_WARP_TRANSPOSE
, meaning items are locally reordered among threads so that memory references will be efficiently coalesced using a warp-striped access pattern.#include <cub/cub.cuh> // or equivalently <cub/block/block_store.cuh> __global__ void ExampleKernel(int *d_data, int valid_items, ...) { // Specialize BlockStore for a 1D block of 128 threads owning 4 integer items each using BlockStore = cub::BlockStore<int, 128, 4, BLOCK_STORE_WARP_TRANSPOSE>; // Allocate shared memory for BlockStore __shared__ typename BlockStore::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Store items to linear memory int thread_data[4]; BlockStore(temp_storage).Store(d_data, thread_data, valid_items);
Suppose the set of
thread_data
across the block of threads is{ [0,1,2,3], [4,5,6,7], ..., [508,509,510,511] }
andvalid_items
is5
. The outputd_data
will be0, 1, 2, 3, 4, ?, ?, ?, ...
, with only the first two threads being unmasked to store portions of valid data.- Parameters
block_itr[out] – The thread block’s base output iterator for storing to
items[in] – Data to store
valid_items[in] – Number of valid items to write
-
struct TempStorage : public Uninitialized<_TempStorage>
The operations exposed by BlockStore require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the
__shared__
keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) orunion
’d with other storage allocation types to facilitate memory reuse.