cuda::atomic_ref#
Defined in header <cuda/atomic>:
template <typename T, cuda::thread_scope Scope = cuda::thread_scope_system>
class cuda::atomic_ref;
The class template cuda::atomic_ref is an extended form of cuda::std::atomic_ref
that takes an additional cuda::thread_scope argument, defaulted to
cuda::std::thread_scope_system.
It has the same interface and semantics as cuda::std::atomic_ref, with the following additional operations. This class additionally deviates from the standard by being backported to C++11.
| Atomically find the minimum of the stored value and a provided value. | |
| Atomically find the maximum of the stored value and a provided value. | 
Limitations#
cuda::atomic_ref<T> and cuda::std::atomic_ref<T> may only be instantiated when T satisfies sizeof(T) <= 8 or sizeof(T) <= 16 when requirements are met.
- The operations available to Twhensizeof(T) == 16depend on the architecture:
- On SM70 and later: - loadand- storeare supported.
- On SM90 and later: - fetch_*and synchronization operations are supported, implemented via atomic compare-and-swap (CAS).
 
No object or subobject of an object referenced by an atomic_ref shall be concurrently referenced by any other
atomic_ref that has a different Scope.
- For cuda::atomic_ref<T>andcuda::std::atomic_ref<T>the typeTmust satisfy the following:
- sizeof(T) <= 16.
- The referenced object must be aligned to its size: - alignof(T) == sizeof(T).
- Tmust not have “padding bits”, i.e., T’s object representation must not have bits that do not participate in it’s value representation.
 
Concurrency Restrictions#
See memory model documentation for general restrictions on atomicity.
- With CUDA Compute Capability 6 (Pascal), an object of type atomic_refmay not be used:
- with a reference to an object with a automatic storage duration in a GPU thread, or 
- if - is_always_lock_free()is- false.
 
For CUDA Compute Capability prior to 6 (Pascal), objects of type cuda::atomic_ref or
cuda::std::atomic_ref may not be used.
Implementation-Defined Behavior#
For each type T and cuda::thread_scope S, the value of
cuda::atomic_ref<T, S>::is_always_lock_free() and cuda::std::atomic_ref<T>::is_always_lock_free() is as follows:
| Type  | 
 | |
| Any valid type | Any thread scope | 
 | 
Types of T, where sizeof(T) < 4, are not natively supported by the underlying hardware. For these types atomic
operations are emulated and will be drastically slower. Contention with contiguous memory in the current 4 byte boundary
will be exacerbated. In these situations it is advisable to perform a hierarchical reduction to non-adjacent memory first.
Example#
#include <cuda/atomic>
__global__ void example_kernel(int *gmem, int *pinned_mem) {
  // This atomic is suitable for all threads in the system.
  cuda::atomic_ref<int, cuda::thread_scope_system> a(*pinned_mem);
  // This atomic has the same type as the previous one (`a`).
  cuda::atomic_ref<int> b(*pinned_mem);
  // This atomic is suitable for all threads on the current processor (e.g. GPU).
  cuda::atomic_ref<int, cuda::thread_scope_device> c(*gmem);
  __shared__ int shared_v;
  // This atomic is suitable for threads in the same thread block.
  cuda::atomic_ref<int, cuda::thread_scope_block> d(shared_v);
}