Warp Match All
Defined in <cuda/warp>
header.
namespace cuda::device {
template <typename T>
[[nodiscard]] __device__ bool
warp_match_all(const T& data, lane_mask = lane_mask::all_active());
} // namespace cuda::device
The functionality provides a generalized and safe alternative to CUDA warp match all intrinsic __match_all_sync
.
The function allows bitwise comparison of any data size, including raw arrays, pointers, and structs.
Parameters
data
: data to compare.lane_mask
: mask of the active lanes.
Return value
true
if all lanes in thelane_mask
have the same value fordata
.false
otherwise.
Preconditions
The functionality is only supported on
SM >= 70
.lane_mask
must be a subset of the active mask and be non-zero.
Performance considerations
The function calls the PTX instruction
match.sync
\(ceil\left(\frac{sizeof(data)}{4}\right)\) times.The function is slightly faster when called with a mask of all active lanes (overload function) even if all lanes participates in the call.
The function is slower when called with a non-fully active warp.
References
Example
#include <cuda/std/array>
#include <cuda/std/cassert>
#include <cuda/warp>
struct MyStruct {
double x;
int y;
};
__global__ void warp_match_kernel() {
assert(cuda::device::warp_match_all(2));
assert(cuda::device::warp_match_all(2, cuda::device::lane_mask::all()));
assert(cuda::device::warp_match_all(MyStruct{1.0, 3}));
assert(!cuda::device::warp_match_all(threadIdx.x));
}
int main() {
warp_match_kernel<<<1, 32>>>();
cudaDeviceSynchronize();
return 0;
}