cuda::device::warp_match_any#
Defined in <cuda/warp> header.
namespace cuda::device {
template <typename T>
[[nodiscard]] __device__ lane_mask
warp_match_any(const T& data, lane_mask = lane_mask::all());
} // namespace cuda::device
The functionality provides a generalized and safe alternative to CUDA warp match any intrinsic __match_any_sync.
The function allows bitwise comparison of any data size, including raw arrays, pointers, and structs.
Note
The underlying CUDA intrinsic does not provide memory ordering.
Parameters
data: data to compare.lane_mask: mask of the active lanes.
Return value
A
lane_maskrepresenting the non-exited lanes inlane_maskthat have the same bitwise value fordataas the calling lane.
Constraints
Tshall be trivially copyable, see cuda::is_trivially_copyable.When
__builtin_clear_paddingis not supported,Tshall have no padding bits, that is,T’s value representation shall be identical to its object representation.
Preconditions
The functionality is only supported on
SM >= 70.lane_maskmust be non-zero.
Undefined Behavior
lane_maskmust represent a subset of the active lanes.All non-exited lanes specified by
lane_maskmust execute the function with the samelane_maskvalue.
Performance considerations
The function calls the PTX instruction
match.sync\(ceil\left(\frac{sizeof(data)}{4}\right)\) times.The function is faster when called with a mask representing all active lanes in a warp (default value of the second parameter
lane_mask).The function uses
__ballot_syncwhenTisbool.
References
Example#
#include <cuda/std/array>
#include <cuda/std/cassert>
#include <cuda/warp>
struct MyStruct {
double x; // 8 bytes
int y; // 4 bytes
}; // 4 bytes of padding
__global__ void warp_match_kernel() {
{
auto mask = cuda::device::warp_match_any(threadIdx.x / 4);
auto expected = cuda::device::lane_mask{0b1111 << ((threadIdx.x / 4) * 4)};
assert(mask == expected);
}
{
auto mask = cuda::device::warp_match_any(2);
auto expected = cuda::device::lane_mask{0xFFFFFFFF};
assert(mask == expected);
}
{
// compile error, except when __builtin_clear_padding is supported
auto mask = cuda::device::warp_match_any(MyStruct{1.0, 3});
auto expected = cuda::device::lane_mask{0xFFFFFFFF};
assert(mask == expected);
}
}
int main() {
warp_match_kernel<<<1, 32>>>();
cudaDeviceSynchronize();
return 0;
}