Tuning NVCC for your CUDA kernel#

In this section, we walk through optimizing the runtime of a CUDA reduction kernel by tuning NVCC compiler controls with CompileIQ.

The example code can be found in our repo here.

CUDA Reduction Example#

This example uses a self-contained CUDA reduction kernel (reduction.cu) that sums 64M integers using shared memory and warp shuffle. CompileIQ tunes the NVCC compiler controls via --apply-controls to minimize runtime.

What you’ll need:

  • A Python environment with CompileIQ installed

  • CUDA Toolkit (CTK) 13.3+

  • A GPU (Blackwell sm_100 by default, adjustable via --arch)

Building the objective function#

The objective function receives a hex-encoded compiler configuration from CompileIQ, writes it to a temporary file, compiles the kernel with --apply-controls, runs it, and returns the execution time:

from compileiq.types import INVALID_SCORE

def objective(config_blob: str) -> float:
    with tempfile.NamedTemporaryFile(suffix=".bin", delete=False) as f:
        f.write(bytes.fromhex(config_blob))
        config_path = Path(f.name)
    try:
        result = build_and_run(arch, config_path)
        return result["mean_ms"] if result["success"] else INVALID_SCORE
    finally:
        config_path.unlink(missing_ok=True)

The build_and_run helper compiles in a single step and benchmarks:

subprocess.run([nvcc, "-arch=sm_100", "-O3", "-std=c++17",
                "--apply-controls", str(config_file),
                "reduction.cu", "-o", str(exe)], ...)

As described in our Safety Section, the objective includes timeouts, correctness checking ("Test passed" validation), and returns INVALID_SCORE on any failure.

Running the example#

# Run optimization
python optimize_reduction.py --arch sm_100

# Benchmark the saved config
python optimize_reduction.py --benchmark-only \
    --nvcc-options "--apply-controls reduction_best_config.bin"

Comparison with the PTXAS example#

This NVCC example differs from the PTXAS spill example in two key ways:

  • Full compilation: NVCC compiles .cu source to a runnable binary, so each evaluation is slower but measures actual runtime performance.

  • Runtime metric: The objective minimizes execution time rather than register spills.

Both examples follow the same pattern: define an objective, fetch a compiler search space, and run a CompileIQ search.

NOTE: ACFs can contain controls for specific compilers. When using a PTXAS ACF with NVCC, pass it directly to PTXAS via:

      nvcc -Xptxas="--apply-controls=best_config.bin" kernel.cu