WarpStoreAlgorithm#

enum cub::WarpStoreAlgorithm#

cub::WarpStoreAlgorithm enumerates alternative algorithms for cub::WarpStore to write a blocked arrangement of items across a CUDA warp to a linear segment of memory.

Values:

enumerator WARP_STORE_DIRECT#

Overview#

A blocked arrangement of data is written directly to memory.

Performance Considerations#

The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).

enumerator WARP_STORE_STRIPED#

Overview#

A striped arrangement of data is written directly to memory.

Performance Considerations#

The utilization of memory transactions (coalescing) remains high regardless of items written per thread.

enumerator WARP_STORE_VECTORIZE#

Overview#

A blocked arrangement of data is written directly to memory using CUDA’s built-in vectorized stores as a coalescing optimization. For example, st.global.v4.s32 instructions will be generated when T = int and ITEMS_PER_THREAD % 4 == 0.

Performance Considerations#

  • The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector store width (typically 4 items or 64B, whichever is lower).

  • The following conditions will prevent vectorization and writing will fall back to cub::WARP_STORE_DIRECT:

    • ITEMS_PER_THREAD is odd

    • The OutputIteratorT is not a simple pointer type

    • The block output offset is not quadword-aligned

    • The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)

enumerator WARP_STORE_TRANSPOSE#

Overview#

A blocked arrangement is locally transposed and then efficiently written to memory as a striped arrangement.

Performance Considerations#

  • The utilization of memory transactions (coalescing) remains high regardless of items written per thread.

  • The local reordering incurs slightly longer latencies and throughput than the direct cub::WARP_STORE_DIRECT and cub::WARP_STORE_VECTORIZE alternatives.