cub::BlockLoadAlgorithm

Defined in cub/block/block_load.cuh

enum cub::BlockLoadAlgorithm

cub::BlockLoadAlgorithm enumerates alternative algorithms for cub::BlockLoad to read a linear segment of data from memory into a blocked arrangement across a CUDA thread block.

Values:

enumerator BLOCK_LOAD_DIRECT

Overview

A blocked arrangement of data is read directly from memory.

Performance Considerations

The utilization of memory transactions (coalescing) decreases as the access stride between threads increases (i.e., the number items per thread).

enumerator BLOCK_LOAD_STRIPED

Overview

A striped arrangement of data is read directly from memory.

Performance Considerations

The utilization of memory transactions (coalescing) doesn’t depend on the number of items per thread.

enumerator BLOCK_LOAD_VECTORIZE

Overview

A blocked arrangement of data is read from memory using CUDA’s built-in vectorized loads as a coalescing optimization. For example, ld.global.v4.s32 instructions will be generated when T = int and ITEMS_PER_THREAD % 4 == 0.

Performance Considerations

  • The utilization of memory transactions (coalescing) remains high until the the access stride between threads (i.e., the number items per thread) exceeds the maximum vector load width (typically 4 items or 64B, whichever is lower).

  • The following conditions will prevent vectorization and loading will fall back to cub::BLOCK_LOAD_DIRECT:

    • ITEMS_PER_THREAD is odd

    • The RandomAccessIterator is not a simple pointer type

    • The block input offset is not quadword-aligned

    • The data type T is not a built-in primitive or CUDA vector type (e.g., short, int2, double, float2, etc.)

enumerator BLOCK_LOAD_TRANSPOSE

Overview

A striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.

Performance Considerations

  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.

  • The local reordering incurs slightly longer latencies and throughput than the direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.

enumerator BLOCK_LOAD_WARP_TRANSPOSE

Overview

A warp-striped arrangement of data is read efficiently from memory and then locally transposed into a blocked arrangement.

Usage Considerations

  • BLOCK_THREADS must be a multiple of WARP_THREADS

Performance Considerations

  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.

  • The local reordering incurs slightly larger latencies than the direct cub::BLOCK_LOAD_DIRECT and cub::BLOCK_LOAD_VECTORIZE alternatives.

  • Provisions more shared storage, but incurs smaller latencies than the BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED alternative.

enumerator BLOCK_LOAD_WARP_TRANSPOSE_TIMESLICED

Overview

Like BLOCK_LOAD_WARP_TRANSPOSE, a warp-striped arrangement of data is read directly from memory and then is locally transposed into a blocked arrangement. To reduce the shared memory requirement, only one warp’s worth of shared memory is provisioned and is subsequently time-sliced among warps.

Usage Considerations

  • BLOCK_THREADS must be a multiple of WARP_THREADS

Performance Considerations

  • The utilization of memory transactions (coalescing) remains high regardless of items loaded per thread.

  • Provisions less shared memory temporary storage, but incurs larger latencies than the BLOCK_LOAD_WARP_TRANSPOSE alternative.