cub::BlockLoadAlgorithm
Defined in cub/block/block_load.cuh
-
enum cub::BlockLoadAlgorithm
cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.
Values:
-
enumerator BLOCK_LOAD_DIRECT
Overview
A blocked arrangement of data is read directly from memory.
Performance Considerations
The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).
-
enumerator BLOCK_LOAD_STRIPED
Overview
A striped arrangement of data is read directly from memory.
Performance Considerations
The utilization of memory transactions (coalescing) doesn’t depend on the number of items per thread.
-
enumerator BLOCK_LOAD_VECTORIZE
Overview
A blocked arrangement of data is read from memory using CUDA’s built-in vectorized loads as a coalescing optimization. For example,
ld.global.v4.s32
instructions will be generated whenT = int
andITEMS_PER_THREAD % 4 == 0
.Performance Considerations
The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector load width (typically 4 items or 64B, whichever is lower).
The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:
ITEMS_PER_THREAD
is oddThe
RandomAccessIterator
is not a simple pointer typeThe block input offset is not quadword-aligned
The data type
T
is not a built-in primitive or CUDA vector type (e.g.,short
,int2
,double
,float2
, etc.)
-
enumerator BLOCK_LOAD_TRANSPOSE
Overview
A striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.
Performance Considerations
The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.
The local reordering incurs slightly longer latencies and throughput than the direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.
-
enumerator BLOCK_LOAD_WARP_TRANSPOSE
Overview
A warp-striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.
Usage Considerations
BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations
The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.
The local reordering incurs slightly larger latencies than the direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.
Provisions more shared storage, but incurs smaller latencies than the BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED alternative.
-
enumerator BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED
Overview
Like
BLOCK_LOAD_WARP_TRANSPOSE
, a warp-striped arrangement of data is read directly from memory and then is locally transposed into a blocked arrangement. To reduce the shared memory requirement, only one warp’s worth of shared memory is provisioned and is subsequently time-sliced among warps.Usage Considerations
BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations
The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.
Provisions less shared memory temporary storage, but incurs larger latencies than the BLOCK_LOAD_WARP_TRANSPOSE alternative.
-
enumerator BLOCK_LOAD_DIRECT