cub::LoadDirectBlockedVectorized#

template<typename T, int ItemsPerThread> void cub::LoadDirectBlockedVectorized( int linear_tid, T *block_src_ptr, T (&dst_items)[ItemsPerThread] )#

Load a linear segment of items into a blocked arrangement across the thread block.

Added in version 2.2.0: First appears in CUDA Toolkit 12.3.

Assumes a blocked arrangement of (block-threads * items-per-thread) items across the thread block, where thread_i owns the i^th range of items-per-thread contiguous items. For multi-dimensional thread blocks, a row-major thread ordering is assumed.

The input offset (block_ptr + block_offset) must be quad-item aligned

The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:

ItemsPerThread is odd
The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)

Template Parameters:

T – [inferred] The data type to load.
ItemsPerThread – [inferred] The number of consecutive items partitioned onto each thread.

Parameters:

linear_tid – [in] A suitable 1D thread-identifier for the calling thread (e.g., (threadIdx.y * blockDim.x) + linear_tid for 2D thread blocks)
block_src_ptr – [in] The thread block’s base pointer for loading from
dst_items – [out] destination to load data into