Device-Wide Primitives#
Almost all of CUB’s device-wide APIs come in two flavors:
the traditional two-phase style that requires calling the API twice and managing temporary storage explicitly,
and the newer single-phase style where temporary storage is obtained from a memory resource in the execution environment.
Some APIs that do not require any temporary storage may have a traditional single-phase overload in addition to an newer environment one.
Two-Phase API (explicit temporary storage management)#
Traditional two-phase APIs can be recognized by taking void* d_temp_storage, size_t& temp_storage_bytes as their first two parameters.
They follow a two-phase usage pattern that requires three steps:
Query Phase: The algorithm is called the first time with
d_temp_storage = nullptrto determine the required temporary storage size. The needed size in bytes is written to the parametertemp_storage_bytes.Temporary storage allocation: The user is responsible to allocate device memory of at least
temp_storage_bytesbytes.Execution Phase: The algorithm is called the second time with
d_temp_storagepointing to the allocated device memory, performing the actual operation.
In principle, the query phase and execution phase must call the same CUB API. This means in detail:
Template arguments: The query call must use the same template arguments as the execution call, so they share the same template instantiation.
Argument values: Regarding function parameters, only the values of the
d_temp_storage,temp_storage_bytes, and problem-size related arguments (like number of elements, number of segments, segment sizes, etc.) may be read during the query phase. No other parameters (like input/output iterators, initial values, etc.) are accessed during the query phase, so their values may be indeterminate. During the query phase, the API will return before launching any kernels or touching user storage.Current device: The computed temporary storage size is valid only when the execution phase runs on the same current CUDA device as the query. Re-run the query if the current device changes between phases.
Example pattern:
// Request and allocate temporary storage
void* d_temp_storage = nullptr;
size_t temp_storage_bytes = 0;
CubDebugExit(DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items));
CubDebugExit(g_allocator.DeviceAllocate(&d_temp_storage, temp_storage_bytes));
// Run
CubDebugExit(DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items));
Environment API (single phase)#
Environment-based overloads are available for all CUB device-wide algorithms. They remove the split of query/execute phase and manually obtaining the temporary storage. Instead, the temporary storage is automatically requested from a memory resource queried from the execution environment argument. The environment supports further properties like passing a stream or an execution requirement in addition to a memory resource.
Key properties of the environment argument:
It is a defaulted parameter and appears as the last argument.
Streams like cudaStream_t or cuda::stream_ref can be passed as environments directly, or added to the environment.
You can select the memory resource (CCCL-provided or custom) used for internal allocations.
Supported algorithms accept determinism requirements (for example,
cuda::execution::determinism::gpu_to_gpu).Multiple properties compose into a single centralized argument by wrapping them into a
cuda::execution::envobject.
Example pattern:
// Setup device, stream, memory resource, determinism
auto device = cuda::devices[device_ordinal];
auto stream = cuda::stream{device};
auto memory_resource = cuda::device_default_memory_pool(device);
auto determinism = cuda::execution::require(cuda::execution::determinism::run_to_run);
// Create environment
auto env = cuda::std::execution::env{cuda::stream_ref{stream}, memory_resource, determinism};
// Run
CubDebugExit(cub::DeviceReduce::Sum(d_in.data(), d_out.data(), num_items, env));
API overview#
In the following, the various groups of CUB device-wide algorithms are listed, linking to their respective documentation.
CUB device-level single-problem parallel algorithms:
cub::DeviceAdjacentDifferencecomputes the difference between adjacent elements residing within device-accessible memorycub::DeviceForprovides device-wide, parallel operations for iterating over data residing within device-accessible memorycub::DeviceHistogramconstructs histograms from data samples residing within device-accessible memorycub::DevicePartitionpartitions data residing within device-accessible memorycub::DeviceMergemerges two sorted sequences in device-accessible memory into a single onecub::DeviceMergeSortsorts items residing within device-accessible memorycub::DeviceRadixSortsorts items residing within device-accessible memory using radix sorting methodcub::DeviceReducecomputes reduction of items residing within device-accessible memorycub::DeviceRunLengthEncodedemarcating “runs” of same-valued items within a sequence residing within device-accessible memorycub::DeviceScancomputes a prefix scan across a sequence of data items residing within device-accessible memorycub::DeviceSelectcompacts data residing within device-accessible memorycub::DeviceTransformtransforms elements from multiple input sequences into an output sequencecub::DeviceTopKfinds the largest (or smallest) K items from an unordered list residing within device-accessible memory
CUB device-level segmented-problem (batched) parallel algorithms:
cub::DeviceSegmentedSortcomputes batched sort across non-overlapping sequences of data residing within device-accessible memorycub::DeviceSegmentedRadixSortcomputes batched radix sort across non-overlapping sequences of data residing within device-accessible memorycub::DeviceSegmentedReducecomputes reductions across multiple sequences of data residing within device-accessible memorycub::DeviceCopyprovides device-wide, parallel operations for batched copying of data residing within device-accessible memorycub::DeviceMemcpyprovides device-wide, parallel operations for batched copying of data residing within device-accessible memorycub::DeviceFindprovides vectorized binary search algorithms