cub::BlockStoreAlgorithm
Defined in cub/block/block_store.cuh
-
enum cub::BlockStoreAlgorithm
cub::BlockStoreAlgorithm enumerates alternative algorithms for cub::BlockStore to write a blocked arrangement of items across a CUDA thread block to a linear segment of memory.
Values:
-
enumerator BLOCK_STORE_DIRECT
Overview
A blocked arrangement of data is written directly to memory.
Performance Considerations
The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).
-
enumerator BLOCK_STORE_STRIPED
Overview
A striped arrangement of data is written directly to memory.
Performance Considerations
The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
-
enumerator BLOCK_STORE_VECTORIZE
Overview
A blocked arrangement of data is written directly to memory using CUDA’s built-in vectorized stores as a coalescing optimization. For example,
st.global.v4.s32
instructions will be generated whenT = int
andITEMS_PER_THREAD % 4 == 0
.Performance Considerations
The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector store width (typically 4 items or 64B, whichever is lower).
The following conditions will prevent vectorization and writing will fall back to cub::BLOCK_STORE_DIRECT:
ITEMS_PER_THREAD
is oddThe
OutputIteratorT
is not a simple pointer typeThe block output offset is not quadword-aligned
The data type
T
is not a built-in primitive or CUDA vector type (e.g.,short
,int2
,double
,float2
, etc.)
-
enumerator BLOCK_STORE_TRANSPOSE
Overview
A blocked arrangement is locally transposed and then efficiently written to memory as a striped arrangement.
Performance Considerations
The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
The local reordering incurs slightly longer latencies and throughput than the direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
-
enumerator BLOCK_STORE_WARP_TRANSPOSE
Overview
A blocked arrangement is locally transposed and then efficiently written to memory as a warp-striped arrangement.
Usage Considerations
BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations
The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
The local reordering incurs slightly longer latencies and throughput than the direct cub::BLOCK_STORE_DIRECT and cub::BLOCK_STORE_VECTORIZE alternatives.
-
enumerator BLOCK_STORE_WARP_TRANSPOSE_TIMESLICED
Overview
A blocked arrangement is locally transposed and then efficiently written to memory as a warp-striped arrangement. To reduce the shared memory requirement, only one warp’s worth of shared memory is provisioned and is subsequently time-sliced among warps.
Usage Considerations
BLOCK_THREADS must be a multiple of WARP_THREADS
Performance Considerations
The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
Provisions less shared memory temporary storage, but incurs larger latencies than the BLOCK_STORE_WARP_TRANSPOSE alternative.
-
enumerator BLOCK_STORE_DIRECT