Device-Wide Primitives#
Determining Temporary Storage Requirements#
Two-Phase API (Traditional)
Most CUB device-wide algorithms follow a two-phase usage pattern:
Query Phase: Call the algorithm with
d_temp_storage = nullptrto determine the required temporary storage sizeExecution Phase: Allocate storage and call the algorithm again to perform the actual operation
What arguments are needed during the query phase?
Template instantiation: The query call must use the same template arguments as the execution call.
Argument access: Aside from
d_temp_storage,temp_storage_bytes, and the problem-size arguments, no parameters are accessed during the query phase, so their values may be indeterminate. The dispatch layer returns before launching kernels or touching user storage.Current device: The computed temporary storage size is valid only when the execution call runs on the same current CUDA device as the query. Re-run the query if the current device changes between phases.
Example pattern:
// Request and allocate temporary storage
void* d_temp_storage = nullptr;
size_t temp_storage_bytes = 0;
CubDebugExit(DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items));
CubDebugExit(g_allocator.DeviceAllocate(&d_temp_storage, temp_storage_bytes));
// Run
CubDebugExit(DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items));
Single-Phase API (Environment-Based)
Environment-based overloads are rolling out across CUB device-wide primitives. They remove the manual query/execute split by obtaining the temporary storage from a memory resource queried from the execution environment argument.
Key properties of the environment argument:
It is defaulted and appears as the last argument.
Streams can be specified with
cuda::get_streamproperties.You can select the memory resource (CCCL-provided or custom) used for internal allocations.
Supported algorithms accept determinism requirements (for example,
cuda::execution::determinism::gpu_to_gpu).Multiple properties compose into a single centralized argument.
Example (centralized control via a single environment argument):
#include <cub/device/device_reduce.cuh>
#include <cuda/std/execution>
#include <cuda/stream_ref>
#include <cuda/__memory_resource/get_memory_resource.h>
#include <cuda/__execution/determinism.h>
// Build an execution environment with stream, memory resource, and determinism
cudaStream_t stream = /* ... */;
auto stream_env = cuda::std::execution::prop{cuda::get_stream_t{}, cuda::stream_ref{stream}};
auto mr = /* CCCL-provided or user-defined device_memory_resource */;
auto mr_env = cuda::std::execution::prop{cuda::mr::__get_memory_resource_t{}, mr};
auto det_env = cuda::execution::require(cuda::execution::determinism::gpu_to_gpu);
auto env = cuda::std::execution::env{stream_env, mr_env, det_env};
// Single-phase API (no explicit temp storage, environment last and defaulted)
cub::DeviceReduce::Reduce(d_in, d_out, num_items, cuda::std::plus<>{}, init, env);
The remainder of this page focuses on the traditional two-phase pattern; see individual algorithm documentation for the availability and specifics of single-phase overloads.
CUB device-level single-problem parallel algorithms:
cub::DeviceAdjacentDifferencecomputes the difference between adjacent elements residing within device-accessible memorycub::DeviceForprovides device-wide, parallel operations for iterating over data residing within device-accessible memorycub::DeviceHistogramconstructs histograms from data samples residing within device-accessible memorycub::DevicePartitionpartitions data residing within device-accessible memorycub::DeviceMergemerges two sorted sequences in device-accessible memory into a single onecub::DeviceMergeSortsorts items residing within device-accessible memorycub::DeviceRadixSortsorts items residing within device-accessible memory using radix sorting methodcub::DeviceReducecomputes reduction of items residing within device-accessible memorycub::DeviceRunLengthEncodedemarcating “runs” of same-valued items within a sequence residing within device-accessible memorycub::DeviceScancomputes a prefix scan across a sequence of data items residing within device-accessible memorycub::DeviceSelectcompacts data residing within device-accessible memorycub::DeviceTopKfinds the largest (or smallest) K items from an unordered list residing within device-accessible memory
CUB device-level segmented-problem (batched) parallel algorithms:
cub::DeviceSegmentedSortcomputes batched sort across non-overlapping sequences of data residing within device-accessible memorycub::DeviceSegmentedRadixSortcomputes batched radix sort across non-overlapping sequences of data residing within device-accessible memorycub::DeviceSegmentedReducecomputes reductions across multiple sequences of data residing within device-accessible memorycub::DeviceCopyprovides device-wide, parallel operations for batched copying of data residing within device-accessible memorycub::DeviceMemcpyprovides device-wide, parallel operations for batched copying of data residing within device-accessible memory