static_for
Defined in <cuda/utility>
header.
namespace cuda {
template <auto Size, typename Operator, typename... TArgs>
__host__ __device__
constexpr void static_for(Operator op, TArgs&&... args)
noexcept(noexcept(op(/*integral_constant*/, args...)))
template <auto Start, auto End, auto Step = 1, typename Operator, typename... TArgs>
__host__ __device__
constexpr void static_for(Operator op, TArgs&&... args)
noexcept(noexcept(op(/*integral_constant*/, args...)))
template <typename T, T Size, typename Operator, typename... TArgs>
__host__ __device__
constexpr void static_for(Operator op, TArgs&&... args)
noexcept(noexcept(op(/*integral_constant*/, args...)))
template <typename T, T Start, T End, T Step = 1, typename Operator, typename... TArgs>
__host__ __device__
constexpr void static_for(Operator op, TArgs&&... args)
noexcept(noexcept(op(/*integral_constant*/, args...)))
} // namespace cuda
for
loop with compile-time indices.static_for
is available in two forms:Executes
op
for each value in the range[0, Size)
.Executes
op
for each value in the range[Start, End)
with stepStep
.
Parameters
Size
: the number of iterations.Start
,End
,Step
: the start, end, and step of the range. Note thatEnd
andStep
are converted to the type ofStart
.T
: type of the loop index.op
: the function to execute.args
: additional arguments to pass toop
.
op
is a callable object that accepts an integral_constant
of the same type of Size
or Start
.
Performance considerations
The functions are useful as metaprogramming utility and when a loop requires full unrolling, independently of the compiler’s constrains, optimization level, and heuristics. In addition, the index is a compile-time constant, which can be used in a constant expression and further optimize the code.
Conversely,
static_for
is more expensive to compile compared to#pragma unroll
. Additionally, the preprocessor directive interacts with the compiler, which tunes the loop unrolling based on register usage, binary size, and instruction cache.
Example
#include <cuda/utility>
#include <cstdio>
__global__ void kernel() {
cuda::static_for<5>([](auto i){ static_assert(i >= 0 && i < 5); });
cuda::static_for<5>([](auto i){ printf("%d, ", i()); }); // 0, 1, 2, 3, 4,
printf("\n");
cuda::static_for<short, 5>([](auto i){ printf("%d, ", i()); }); // 0, 1, 2, 3, 4,
printf("\n");
cuda::static_for<-5, 7, 3>([](auto i){ printf("%d, ", i()); }); // -5, -2, 1, 4,
printf("\n");
cuda::static_for<5>([](auto i){
if constexpr (i > 0) {
cuda::static_for<i()>([](auto j){ printf("%d, ", j()); });
printf("\n");
}
});
// 0,
// 0, 1,
// 0, 1, 2,
// 0, 1, 2, 3,
cuda::static_for<5>([](auto i, int a, int b, int c){}, 1, 2, 3); // 1, 2, 3 optional arguments
}
int main() {
kernel<<<1, 1>>>();
cudaDeviceSynchronize();
}