cuda::barrier
Defined in header <cuda/barrier>
:
template <cuda::thread_scope Scope,
typename CompletionFunction = /* unspecified */>
class cuda::barrier;
The class template cuda::barrier
is an extended form of cuda::std::barrier
that takes an additional cuda::thread_scope
argument. It has the same interface and semantics as cuda::std::barrier
, with the following additional operations.
Barrier Operations
cuda::barrier::init | Initialize a cuda::barrier . (friend function) |
cuda::device::barrier_native_handle | Get the native handle to a cuda::barrier . (function template) |
NVCC __shared__
Initialization Warnings
When using libcu++ with NVCC, a __shared__
cuda::barrier
will lead to the following warning because __shared__
variables are not initialized:
warning: dynamic initialization is not supported for a function-scope static
__shared__ variable within a __device__/__global__ function
It can be silenced using #pragma diag_suppress static_var_with_dynamic_init
.
To properly initialize a __shared__
cuda::barrier
, use the cuda::barrier::init
friend function.
Concurrency Restrictions
An object of type cuda::barrier
or cuda::std::barrier
shall not be accessed concurrently by CPU and GPU threads unless:
- it is in unified memory and the
concurrentManagedAccess
property is 1, or - it is in CPU memory and the
hostNativeAtomicSupported
property is 1.
Note, for objects of scopes other than cuda::thread_scope_system
this is a data-race, and thefore also prohibited regardless of memory characteristics.
Under CUDA Compute Capability 8 (Ampere) or above, when an object of type cuda::barrier<thread_scope_block>
is placed in __shared__
memory, the member function arrive
performs a reduction of the arrival count among coalesced threads followed by the arrival operation in one thread. Programs shall ensure that this transformation would not introduce errors, for example relative to the requirements of thread.barrier.class paragraph 12 of ISO/IEC IS 14882 (the C++ Standard).
Under CUDA Compute Capability 6 (Pascal) or prior, an object of type cuda::barrier
or cuda::std::barrier
may not be used.
Implementation-Defined Behavior
For each cuda::thread_scope
S
and CompletionFunction
F
, the value of cuda::barrier<S, F>::max()
is as follows:
cuda::thread_scope S | CompletionFunction F | barrier<S, F>::max() |
---|---|---|
cuda::thread_scope_block | Default or user-provided | (1 << 20) - 1 |
Not cuda::thread_scope_block | Default | cuda::std::numeric_limits<cuda::std::int32_t>::max() |
Not cuda::thread_scope_block | User-provided | cuda::std::numeric_limits<cuda::std::ptrdiff_t>::max() |
Example
#include <cuda/barrier>
__global__ void example_kernel() {
// This barrier is suitable for all threads in the system.
cuda::barrier<cuda::thread_scope_system> a(10);
// This barrier has the same type as the previous one (`a`).
cuda::std::barrier<> b(10);
// This barrier is suitable for all threads on the current processor (e.g. GPU).
cuda::barrier<cuda::thread_scope_device> c(10);
// This barrier is suitable for all threads in the same thread block.
cuda::barrier<cuda::thread_scope_block> d(10);
}