<cuda/std/bit>
CUDA Performance Considerations
Given an unsigned integer with N
bits and N <= 32
, the <bit>
functions translate into the following SASS instructions. For some functions, the results is decorated with a compile-time assumption to restrict its range and allowing further optimizations.
bit_width()
translates into a singleFLO
SASS instruction. The result is assumed to be in the range[0, N]
.bit_ceil()
translates intoADD, FLO, SHL, IMINMAX
SASS instructions. The result is assumed to be greater than or equal to the input.bit_floor()
translates intoFLO, SHL
SASS instructions. The result is assumed to be less than or equal to the input.byteswap()
translates into a singlePRMT
SASS instruction.popcount()
translates into a singlePOPC
SASS instruction. The result is assumed to be in the range[0, N]
.has_single_bit()
translates intoPOPC + ISETP
SASS instructions.rotl()/rotr()
translate into a singleSHF
(funned shift) SASS instruction.countl_zero()
translates intoFLO, IMINMAX
SASS instructions. The result is assumed to be in the range[0, N]
.countl_one()
translates intoLOP3, FLO, IMINMAX
SASS instructions. The result is assumed to be in the range[0, N]
.countr_zero()
translates intoBREV, FLO, IMINMAX
SASS instructions. The result is assumed to be in the range[0, N]
.countr_one()
translates intoLOP3, BREV, FLO, IMINMAX
SASS instructions. The result is assumed to be in the range[0, N]
.
Additional Notes
All functions are marked
[[nodiscard]]
andnoexcept
All functions support 128-bit integer types
bit_ceil()
checks for overflow in debug moderotl()/rotr()
checks for invalid count value (INT_MIN
) in debug mode
Note
When the input values are run-time values that the compiler can resolve at compile-time, e.g. an index of a loop with a fixed number of iterations, using the functions could not be optimal.
Note
GCC <= 8 uses a slow path with more instructions even in CUDA