BlockLoadAlgorithm#

enum cub::BlockLoadAlgorithm#

cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.

Values:

enumerator BLOCK_LOAD_DIRECT#

Overview#

A blocked arrangement of data is read directly from memory.

Performance Considerations#

The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).

enumerator BLOCK_LOAD_STRIPED#

Overview#

A striped arrangement of data is read directly from memory.

Performance Considerations#

The utilization of memory transactions (coalescing) doesn’t depend on the number of items per thread.

enumerator BLOCK_LOAD_VECTORIZE#

Overview#

A blocked arrangement of data is read from memory using CUDA’s built-in vectorized loads as a coalescing optimization. For example, ld.global.v4.s32 instructions will be generated when T = int and ITEMS_PER_THREAD % 4 == 0.

Performance Considerations#

  • The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector load width (typically 4 items or 64B, whichever is lower).

  • The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:

    • ITEMS_PER_THREAD is odd

    • The RandomAccessIterator is not a simple pointer type

    • The block input offset is not quadword-aligned

    • The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)

enumerator BLOCK_LOAD_TRANSPOSE#

Overview#

A striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.

Performance Considerations#

  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.

  • The local reordering incurs slightly longer latencies and throughput than the direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.

enumerator BLOCK_LOAD_WARP_TRANSPOSE#

Overview#

A warp-striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.

Usage Considerations#

  • BLOCK_THREADS must be a multiple of WARP_THREADS

Performance Considerations#

  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.

  • The local reordering incurs slightly larger latencies than the direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.

  • Provisions more shared storage, but incurs smaller latencies than the BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED alternative.

enumerator BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED#

Overview#

Like BLOCK_LOAD_WARP_TRANSPOSE, a warp-striped arrangement of data is read directly from memory and then is locally transposed into a blocked arrangement. To reduce the shared memory requirement, only one warp’s worth of shared memory is provisioned and is subsequently time-sliced among warps.

Usage Considerations#

  • BLOCK_THREADS must be a multiple of WARP_THREADS

Performance Considerations#

  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.

  • Provisions less shared memory temporary storage, but incurs larger latencies than the BLOCK_LOAD_WARP_TRANSPOSE alternative.