`cuda::atomic_ref`

Defined in header <cuda/atomic>:

template <typename T, cuda::thread_scope Scope = cuda::thread_scope_system>
class cuda::atomic_ref;

The class template cuda::atomic_ref is an extended form of cuda::std::atomic_ref that takes an additional cuda::thread_scope argument, defaulted to cuda::std::thread_scope_system.

It has the same interface and semantics as cuda::std::atomic_ref, with the following additional operations. This class additionally deviates from the standard by being backported to C++11.

cuda::atomic_ref::fetch_min	Atomically find the minimum of the stored value and a provided value.
cuda::atomic_ref::fetch_max	Atomically find the maximum of the stored value and a provided value.

Limitations

cuda::atomic_ref<T> and cuda::std::atomic_ref<T> may only be instantiated with a T that are either 4 or 8 bytes.

No object or subobject of an object referenced by an atomic_ref shall be concurrently referenced by any other atomic_ref that has a different Scope.

For cuda::atomic_ref<T> and cuda::std::atomic_ref<T> the type T must satisfy the following:

sizeof(T) <= 8.
T must not have “padding bits”, i.e., T’s object representation must not have bits that do not participate in it’s value representation.

Concurrency Restrictions

See memory model documentation for general restrictions on atomicity.

With CUDA Compute Capability 6 (Pascal), an object of type atomic_ref may not be used:

with a reference to an object with a automatic storage duration in a GPU thread, or
if is_always_lock_free() is false.

For CUDA Compute Capability prior to 6 (Pascal), objects of type cuda::atomic_ref or cuda::std::atomic_ref may not be used.

Implementation-Defined Behavior

For each type T and cuda::thread_scope S, the value of cuda::atomic_ref<T, S>::is_always_lock_free() and cuda::std::atomic_ref<T>::is_always_lock_free() is as follows:

Type `T`	cuda::thread_scope `S`	`cuda::atomic_ref<T, S>::is_always_lock_free()`
Any valid type	Any thread scope	`sizeof(T) <= 8`

Types of T, where sizeof(T) < 4, are not natively supported by the underlying hardware. For these types atomic operations are emulated and will be drastically slower. Contention with contiguous memory in the current 4 byte boundary will be exacerbated. In these situations it is advisable to perform a hierarchical reduction to non-adjacent memory first.

Example

#include <cuda/atomic>

__global__ void example_kernel(int *gmem, int *pinned_mem) {
  // This atomic is suitable for all threads in the system.
  cuda::atomic_ref<int, cuda::thread_scope_system> a(*pinned_mem);

  // This atomic has the same type as the previous one (`a`).
  cuda::atomic_ref<int> b(*pinned_mem);

  // This atomic is suitable for all threads on the current processor (e.g. GPU).
  cuda::atomic_ref<int, cuda::thread_scope_device> c(*gmem);

  __shared__ int shared_v;
  // This atomic is suitable for threads in the same thread block.
  cuda::atomic_ref<int, cuda::thread_scope_block> d(shared_v);
}

See it on Godbolt

cuda::atomic_ref

Limitations

Concurrency Restrictions

Implementation-Defined Behavior

Example

`cuda::atomic_ref`