Execution model#

CUDA C++ aims to provide parallel forward progress [intro.progress.9] for all device threads of execution, facilitating the parallelization of pre-existing C++ applications with CUDA C++.

The CUDA C++ Programming Language is an extension of the C++ Programming Language. This section documents the modifications and extensions to the [intro.progress] section of the current ISO International Standard ISO/IEC 14882 – Programming Language C++ draft. Modified sections are called out explicitly and their diff is shown in bold. All other sections are additions.

Host threads#

The forward progress provided by threads of execution created by the host implementation to execute main, std::thread, and std::jthread is implementation-defined behavior of the host implementation [intro.progress]. General-purpose host implementations should provide concurrent forward progress.

If the host implementation provides concurrent forward progress [intro.progress.7], then CUDA C++ provides parallel forward progress [intro.progress.9] for device threads.

Device threads#

Once a device thread makes progress:

If it is part of a Cooperative Grid, all device threads in its grid shall eventually make progress.
Otherwise, all device threads in its thread-block cluster shall eventually make progress.

[Note: Threads in other thread-block clusters are not guaranteed to eventually make progress. - end note.]

[Note: This implies that all device threads within its thread block shall eventually make progress. - end note.]

Modify [intro.progress.1] as follows (modifications in bold):

The implementation may assume that any host thread will eventually do one of the following:

terminate,

invoke the function std::this_thread::yield ([thread.thread.this]),

make a call to a library I/O function,

perform an access through a volatile glvalue,

perform a synchronization operation or an atomic operation, or

continue execution of a trivial infinite loop ([stmt.iter.general]).

The implementation may assume that any device thread will eventually do one of the following:

terminate,

make a call to a library I/O function,

perform an access through a volatile glvalue except if the designated object has automatic storage duration, or

perform a synchronization operation or an atomic read operation except if the designated object has automatic storage duration.

[Note: Some current limitations of device threads relative to host threads are implementation defects known to us, that we may fix over time. Examples include the undefined behavior that arises from device threads that eventually only perform volatile or atomic operations on automatic storage duration objects. However, other limitations of device threads relative to host threads are intentional choices. They enable performance optimizations that would not be possible if device threads followed the C++ Standard strictly. For example, providing forward progress to programs that eventually only perform atomic writes or fences would degrade overall performance for little practical benefit. - end note.]

CUDA APIs#

A CUDA API call shall eventually either return or ensure at least one device thread makes progress.

CUDA query functions (e.g. cudaStreamQuery, cudaEventQuery, etc.) shall not consistently return cudaErrorNotReady without a device thread making progress.

[Note: The device thread need not be “related” to the API call, e.g., an API operating on one stream or process may ensure progress of a device thread on another stream or process. - end note.]

[Note: A simple but not sufficient method to test a program for CUDA API Forward Progress conformance is to run them with following environment variables set: CUDA_DEVICE_MAX_CONNECTIONS=1 CUDA_LAUNCH_BLOCKING=1, and then check that the program still terminates. If it does not, the program has a bug. This method is not sufficient because it does not catch all Forward Progress bugs, but it does catch many such bugs. - end note.]

Examples of CUDA API forward progress guarantees.

// Example: Execution.Model.API.1
// Outcome: if no other device threads (e.g., from other processes) are making progress,
// this program terminates and returns cudaSuccess.
// Rationale: CUDA guarantees that if the device is empty:
// - `cudaDeviceSynchronize` eventually ensures that at least one device-thread makes progress, which implies that eventually `hello_world` grid and one of its device-threads start.
// - All thread-block threads eventually start (due to "if a device thread makes progress, all other threads in its thread-block cluster eventually make progress").
// - Once all threads in thread-block arrive at `__syncthreads` barrier, all waiting threads are unblocked.
// - Therefore all device threads eventually exit the `hello_world`` grid.
// - And `cudaDeviceSynchronize`` eventually unblocks.
__global__ void hello_world() { __syncthreads(); }
int main() {
    hello_world<<<1,2>>>();
    return (int)cudaDeviceSynchronize();
}

// Example: Execution.Model.API.2
// Allowed outcome: eventually, no thread makes progress.
// Rationale: the `cudaDeviceSynchronize` API below is only called if a device thread eventually makes progress and sets the flag.
// However, CUDA only guarantees that `producer` device thread eventually starts if the synchronization API is called.
// Therefore, the host thread may never be unblocked from the flag spin-loop.
cuda::atomic<int, cuda::thread_scope_system> flag = 0;
__global__ void producer() { flag.store(1); }
int main() {
    cudaHostRegister(&flag, sizeof(flag));
    producer<<<1,1>>>();
    while (flag.load() == 0);
    return cudaDeviceSynchronize();
}

// Example: Execution.Model.API.3
// Allowed outcome: eventually, no thread makes progress.
// Rationale: same as Example.Model.API.2, with the addition that a single CUDA query API call does not guarantee
// the device thread eventually starts, only repeated CUDA query API calls do (see Execution.Model.API.4).
cuda::atomic<int, cuda::thread_scope_system> flag = 0;
__global__ void producer() { flag.store(1); }
int main() {
    cudaHostRegister(&flag, sizeof(flag));
    producer<<<1,1>>>();
    (void)cudaStreamQuery(0);
    while (flag.load() == 0);
    return cudaDeviceSynchronize();
}

// Example: Execution.Model.API.4
// Outcome: terminates.
// Rationale: same as Execution.Model.API.3, but this example repeatedly calls
// a CUDA query API in within the flag spin-loop, which guarantees that the device thread
// eventually makes progress.
cuda::atomic<int, cuda::thread_scope_system> flag = 0;
__global__ void producer() { flag.store(1); }
int main() {
    cudaHostRegister(&flag, sizeof(flag));
    producer<<<1,1>>>();
    while (flag.load() == 0) {
        (void)cudaStreamQuery(0);
    }
    return cudaDeviceSynchronize();
}

Dependencies#

A device thread shall not start until all its dependencies have completed.

[Note: Dependencies that prevent device threads from starting to make progress can be created, for example, via CUDA Stream Commands . These may include dependencies on the completion of, among others, CUDA Events and CUDA Kernels . - end note.]