cuda::atomic_ref
#
Defined in header <cuda/atomic>
:
template <typename T, cuda::thread_scope Scope = cuda::thread_scope_system>
class cuda::atomic_ref;
The class template cuda::atomic_ref
is an extended form of cuda::std::atomic_ref
that takes an additional cuda::thread_scope argument, defaulted to
cuda::std::thread_scope_system
.
It has the same interface and semantics as cuda::std::atomic_ref, with the following additional operations. This class additionally deviates from the standard by being backported to C++11.
Atomically find the minimum of the stored value and a provided value. |
|
Atomically find the maximum of the stored value and a provided value. |
Limitations#
cuda::atomic_ref<T>
and cuda::std::atomic_ref<T>
may only be instantiated with a T that are either 4 or 8 bytes.
No object or subobject of an object referenced by an atomic_ref
shall be concurrently referenced by any other
atomic_ref
that has a different Scope
.
- For
cuda::atomic_ref<T>
andcuda::std::atomic_ref<T>
the typeT
must satisfy the following: sizeof(T) <= 8
.T
must not have “padding bits”, i.e., T’s object representation must not have bits that do not participate in it’s value representation.
Concurrency Restrictions#
See memory model documentation for general restrictions on atomicity.
- With CUDA Compute Capability 6 (Pascal), an object of type
atomic_ref
may not be used: with a reference to an object with a automatic storage duration in a GPU thread, or
if
is_always_lock_free()
isfalse
.
For CUDA Compute Capability prior to 6 (Pascal), objects of type cuda::atomic_ref
or
cuda::std::atomic_ref may not be used.
Implementation-Defined Behavior#
For each type T
and cuda::thread_scope S
, the value of
cuda::atomic_ref<T, S>::is_always_lock_free()
and cuda::std::atomic_ref<T>::is_always_lock_free()
is as follows:
Type |
|
|
Any valid type |
Any thread scope |
|
Types of T
, where sizeof(T) < 4
, are not natively supported by the underlying hardware. For these types atomic
operations are emulated and will be drastically slower. Contention with contiguous memory in the current 4 byte boundary
will be exacerbated. In these situations it is advisable to perform a hierarchical reduction to non-adjacent memory first.
Example#
#include <cuda/atomic>
__global__ void example_kernel(int *gmem, int *pinned_mem) {
// This atomic is suitable for all threads in the system.
cuda::atomic_ref<int, cuda::thread_scope_system> a(*pinned_mem);
// This atomic has the same type as the previous one (`a`).
cuda::atomic_ref<int> b(*pinned_mem);
// This atomic is suitable for all threads on the current processor (e.g. GPU).
cuda::atomic_ref<int, cuda::thread_scope_device> c(*gmem);
__shared__ int shared_v;
// This atomic is suitable for threads in the same thread block.
cuda::atomic_ref<int, cuda::thread_scope_block> d(shared_v);
}