BlockStoreAlgorithm
#
-
enum cub::BlockStoreAlgorithm#
cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory.
Values:
-
enumerator BLOCK_STORE_DIRECT#
Overview#
A blocked arrangement of data is written directly to memory.
Performance Considerations#
The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).
-
enumerator BLOCK_STORE_STRIPED#
Overview#
A striped arrangement of data is written directly to memory.
Performance Considerations#
The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
-
enumerator BLOCK_STORE_VECTORIZE#
Overview#
A blocked arrangement of data is written directly to memory using CUDA’s built-in vectorized stores as a coalescing optimization. For example,
st.global.v4.s32
instructions will be generated whenT = int
andITEMS_PER_THREAD % 4 == 0
.Performance Considerations#
The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector store width (typically 4 items or 64B, whichever is lower).
The following conditions will prevent vectorization and writing will fall back to cub::BLOCK_STORE_DIRECT:
ITEMS_PER_THREAD
is oddThe
OutputIteratorT
is not a simple pointer typeThe block output offset is not quadword-aligned
The data type
T
is not a built-in primitive or CUDA vector type (e.g.,short
,int2
,double
,float2
, etc.)
-
enumerator BLOCK_STORE_TRANSPOSE#
Overview#
A blocked arrangement is locally transposed and then efficiently written to memory as a striped arrangement.
Performance Considerations#
The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
The local reordering incurs slightly longer latencies and throughput than the direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
-
enumerator BLOCK_STORE_WARP_TRANSPOSE#
Overview#
A blocked arrangement is locally transposed and then efficiently written to memory as a warp-striped arrangement.
Usage Considerations#
BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations#
The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
The local reordering incurs slightly longer latencies and throughput than the direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
-
enumerator BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED#
Overview#
A blocked arrangement is locally transposed and then efficiently written to memory as a warp-striped arrangement. To reduce the shared memory requirement, only one warp’s worth of shared memory is provisioned and is subsequently time-sliced among warps.
Usage Considerations#
BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations#
The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
Provisions less shared memory temporary storage, but incurs larger latencies than the BLOCK_STORE_WARP_TRANSPOSE alternative.
-
enumerator BLOCK_STORE_DIRECT#