CUB Determinism#
Several cub device algorithms let you request a reproducibility guarantee for a call. The concepts
behind the three guarantees — not_guaranteed, run_to_run, and gpu_to_gpu — and the meaning
of reproducibility are described in the CCCL determinism overview. This
page documents how to request a guarantee for a CUB algorithm and which algorithms support which
guarantees.
Requesting a guarantee#
A determinism guarantee is passed to a device algorithm through its execution environment using
cuda::execution::require. The example below requests run-to-run reproducibility for
cub::DeviceReduce::Sum:
auto input = thrust::device_vector<float>{0.0f, 1.0f, 2.0f, 3.0f};
auto output = thrust::device_vector<float>(1);
auto env = cuda::execution::require(cuda::execution::determinism::run_to_run);
auto error = cub::DeviceReduce::Sum(input.begin(), output.begin(), input.size(), env);
if (error != cudaSuccess)
{
std::cerr << "cub::DeviceReduce::Sum failed with status: " << error << '\n';
}
thrust::device_vector<float> expected{6.0f};
The general rules for requesting a guarantee are described in the CCCL determinism overview.
Each CUB algorithm has its own default guarantee, applied when none is requested, and its own type and operator constraints for each guarantee, summarized below.
Support matrix#
Algorithm |
|
|
|
Default |
|---|---|---|---|---|
|
Yes |
Yes |
Yes (partial) |
|
|
Yes |
Yes (partial) |
Yes (partial) |
|
|
Yes |
Yes |
No |
|
Note
The set of algorithms that accept determinism requirements, and the type/operator constraints for each guarantee, are expanding over time. The matrix above reflects the current implementation.
Algorithm-specific determinism models#
The three guarantees describe the scope of reproducibility and fit most algorithms, where a reproducible result means a bitwise-identical output. A few algorithms still use the same three levels but extend the model with additional, algorithm-specific controls, documented on their own pages:
cub::DeviceTopK — determinism applies to set membership (which K items are selected) rather than a bitwise-identical buffer, and it adds tie-breaking (
cuda::execution::tie_break) and output-ordering (cuda::execution::output_ordering) controls.