cub::StoreDirectBlockedVectorized

Defined in cub/block/block_store.cuh

template<typename T, int ITEMS_PER_THREAD> void cub::StoreDirectBlockedVectorized(int linear_tid, T *block_ptr, T (&items)[ITEMS_PER_THREAD])

Store a blocked arrangement of items across a thread block into a linear segment of items.

Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where thread_i owns the i^th range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

The output offset (block_ptr + block_offset) must be quad-item aligned, which is the default starting offset returned by cudaMalloc()

The following conditions will prevent vectorization and storing will fall back to cub::BLOCK_STORE_DIRECT:

ITEMS_PER_THREAD is odd

The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)

Template Parameters

T – [inferred] The data type to store.
ITEMS_PER_THREAD – [inferred] The number of consecutive items partitioned onto each thread.

Parameters

linear_tid – [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks)
block_ptr – [in] Input pointer for storing from
items – [in] Data to store