cub::WarpStoreAlgorithm
Defined in cub/warp/warp_store.cuh
-
enum cub::WarpStoreAlgorithm
cub::WarpStoreAlgorithm
enumerates alternative algorithms forcub::WarpStore
to write a blocked arrangement of items across a CUDA warp to a linear segment of memory.Values:
-
enumerator WARP_STORE_DIRECT
Overview
A blocked arrangement of data is written directly to memory.
Performance Considerations
The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).
-
enumerator WARP_STORE_STRIPED
Overview
A striped arrangement of data is written directly to memory.
Performance Considerations
The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
-
enumerator WARP_STORE_VECTORIZE
Overview
A blocked arrangement of data is written directly to memory using CUDA’s built-in vectorized stores as a coalescing optimization. For example,
st.global.v4.s32
instructions will be generated whenT = int
andITEMS_PER_THREAD % 4 == 0
.Performance Considerations
The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector store width (typically 4 items or 64B, whichever is lower).
The following conditions will prevent vectorization and writing will fall back to
cub::WARP_STORE_DIRECT
:ITEMS_PER_THREAD
is oddThe
OutputIteratorT
is not a simple pointer typeThe block output offset is not quadword-aligned
The data type
T
is not a built-in primitive or CUDA vector type (e.g.,short
,int2
,double
,float2
, etc.)
-
enumerator WARP_STORE_TRANSPOSE
Overview
A blocked arrangement is locally transposed and then efficiently written to memory as a striped arrangement.
Performance Considerations
The utilization of memory transactions (coalescing) remains high regardless of items written per thread.
The local reordering incurs slightly longer latencies and throughput than the direct
cub::WARP_STORE_DIRECT
andcub::WARP_STORE_VECTORIZE
alternatives.
-
enumerator WARP_STORE_DIRECT