cub::WarpStoreAlgorithm

Defined in /home/runner/work/cccl/cccl/cub/cub/warp/warp_store.cuh

enum cub::WarpStoreAlgorithm

cub::WarpStoreAlgorithm enumerates alternative algorithms for cub::WarpStore to write a blocked arrangement of items across a CUDA warp to a linear segment of memory.

Values:

enumerator WARP_STORE_DIRECT

Overview

A blocked arrangement of data is written directly to memory.

Performance Considerations

The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).

enumerator WARP_STORE_STRIPED

Overview

A striped arrangement of data is written directly to memory.

Performance Considerations

The utilization of memory transactions (coalescing) remains high regardless of items written per thread.

enumerator WARP_STORE_VECTORIZE

Overview

A blocked arrangement of data is written directly to memory using CUDA’s built-in vectorized stores as a coalescing optimization. For example, st.global.v4.s32 instructions will be generated when T = int and ITEMS_PER_THREAD % 4 == 0.

Performance Considerations

  • The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector store width (typically 4 items or 64B, whichever is lower).

  • The following conditions will prevent vectorization and writing will fall back to cub::WARP_STORE_DIRECT:

    • ITEMS_PER_THREAD is odd

    • The OutputIteratorT is not a simple pointer type

    • The block output offset is not quadword-aligned

    • The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)

enumerator WARP_STORE_TRANSPOSE

Overview

A blocked arrangement is locally transposed and then efficiently written to memory as a striped arrangement.

Performance Considerations

  • The utilization of memory transactions (coalescing) remains high regardless of items written per thread.

  • The local reordering incurs slightly longer latencies and throughput than the direct cub::WARP_STORE_DIRECT and cub::WARP_STORE_VECTORIZE alternatives.