cub::WarpLoadAlgorithm#
-
enum cub::WarpLoadAlgorithm#
cub::WarpLoadAlgorithmenumerates alternative algorithms forcub::WarpLoadto read a linear segment of data from memory into a CUDA warp.Values:
-
enumerator WARP_LOAD_DIRECT#
Overview#
A blocked arrangement of data is read directly from memory.
Performance Considerations#
The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).
-
enumerator WARP_LOAD_STRIPED#
Overview#
A striped arrangement of data is read directly from memory.
Performance Considerations#
The utilization of memory transactions (coalescing) doesn’t depend on the number of items per thread.
-
enumerator WARP_LOAD_VECTORIZE#
Overview#
A blocked arrangement of data is read from memory using CUDA’s built-in vectorized loads as a coalescing optimization. For example,
ld.global.v4.s32instructions will be generated whenT = intandITEMS_PER_THREAD % 4 == 0.Performance Considerations#
The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector load width (typically 4 items or 64B, whichever is lower).
The following conditions will prevent vectorization and loading will fall back to cub::WARP_LOAD_DIRECT:
ITEMS_PER_THREADis oddThe
InputIteratorTis not a simple pointer typeThe block input offset is not quadword-aligned
The data type
Tis not a built-in primitive or CUDA vector type (e.g.,short,int2,double,float2, etc.)
-
enumerator WARP_LOAD_TRANSPOSE#
Overview#
A striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.
Performance Considerations#
The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.
The local reordering incurs slightly longer latencies and throughput than the direct
cub::WARP_LOAD_DIRECTandcub::WARP_LOAD_VECTORIZEalternatives.
-
enumerator WARP_LOAD_DIRECT#