<cuda/std/bit>#
cuda::std::bit_cast#
cuda::std::bit_cast extended the standard std::bit_cast to also recognize CUDA extended floating-point scalar and vector types as trivially copyable.
Limitations
The function can be used in
constexprcontexts only when the source and destination types are trivially copyable.The function cannot be used in
constexprcontexts with MSVC <= 19.25 and GCC <= 10.
CUDA Performance Considerations#
Given an unsigned integer with N bits and N <= 32, the <bit> functions translate into the following SASS instructions. For some functions, the results is decorated with a compile-time assumption to restrict its range and allowing further optimizations.
bit_width()translates into a singleFLOSASS instruction. The result is assumed to be in the range[0, N].bit_ceil()translates intoADD, FLO, SHL, IMINMAXSASS instructions. The result is assumed to be greater than or equal to the input.bit_floor()translates intoFLO, SHLSASS instructions. The result is assumed to be less than or equal to the input.byteswap()translates into a singlePRMTSASS instruction.popcount()translates into a singlePOPCSASS instruction. The result is assumed to be in the range[0, N].has_single_bit()translates intoPOPC + ISETPSASS instructions.rotl()/rotr()translate into a singleSHF(funned shift) SASS instruction.countl_zero()translates intoFLO, IMINMAXSASS instructions. The result is assumed to be in the range[0, N].countl_one()translates intoLOP3, FLO, IMINMAXSASS instructions. The result is assumed to be in the range[0, N].countr_zero()translates intoBREV, FLO, IMINMAXSASS instructions. The result is assumed to be in the range[0, N].countr_one()translates intoLOP3, BREV, FLO, IMINMAXSASS instructions. The result is assumed to be in the range[0, N].
Additional Notes#
All functions are marked
[[nodiscard]]andnoexceptAll functions support 128-bit integer types
bit_ceil()checks for overflow in debug moderotl()/rotr()checks for invalid count value (INT_MIN) in debug mode
Note
When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the functions could not be optimal.
Note
GCC <= 8 uses a slow path with more instructions even in CUDA