cub::WarpLoadAlgorithm

Defined in cub/warp/warp_load.cuh

enum cub::WarpLoadAlgorithm

cub::WarpLoadAlgorithm enumerates alternative algorithms for cub::WarpLoad to read a linear segment of data from memory into a CUDA warp.

Values:

enumerator WARP_LOAD_DIRECT

Overview

A blocked arrangement of data is read directly from memory.

Performance Considerations

The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).

enumerator WARP_LOAD_STRIPED

Overview

A striped arrangement of data is read directly from memory.

Performance Considerations

The utilization of memory transactions (coalescing) doesn’t depend on the number of items per thread.

enumerator WARP_LOAD_VECTORIZE

Overview

A blocked arrangement of data is read from memory using CUDA’s built-in vectorized loads as a coalescing optimization. For example, ld.global.v4.s32 instructions will be generated when T = int and ITEMS_PER_THREAD % 4 == 0.

Performance Considerations

  • The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector load width (typically 4 items or 64B, whichever is lower).

  • The following conditions will prevent vectorization and loading will fall back to cub::WARP_LOAD_DIRECT:

    • ITEMS_PER_THREAD is odd

    • The InputIteratorT is not a simple pointer type

    • The block input offset is not quadword-aligned

    • The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)

enumerator WARP_LOAD_TRANSPOSE

Overview

A striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.

Performance Considerations

  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.

  • The local reordering incurs slightly longer latencies and throughput than the direct cub::WARP_LOAD_DIRECT and cub::WARP_LOAD_VECTORIZE alternatives.