Tuning NVCC for your CUDA kernel#
In this section, we walk through optimizing the runtime of a CUDA reduction kernel by tuning NVCC compiler controls with CompileIQ.
The example code can be found in our repo here.
CUDA Reduction Example#
This example uses a self-contained CUDA reduction kernel (reduction.cu) that sums 64M integers using shared memory and warp shuffle. CompileIQ tunes the NVCC compiler controls via --apply-controls to minimize runtime.
What you’ll need:
A Python environment with CompileIQ installed
CUDA Toolkit (CTK) 13.3+
A GPU (Blackwell sm_100 by default, adjustable via
--arch)
Building the objective function#
The objective function receives a hex-encoded compiler configuration from CompileIQ, writes it to a temporary file, compiles the kernel with --apply-controls, runs it, and returns the execution time:
from compileiq.types import INVALID_SCORE
def objective(config_blob: str) -> float:
with tempfile.NamedTemporaryFile(suffix=".bin", delete=False) as f:
f.write(bytes.fromhex(config_blob))
config_path = Path(f.name)
try:
result = build_and_run(arch, config_path)
return result["mean_ms"] if result["success"] else INVALID_SCORE
finally:
config_path.unlink(missing_ok=True)
The build_and_run helper compiles in a single step and benchmarks:
subprocess.run([nvcc, "-arch=sm_100", "-O3", "-std=c++17",
"--apply-controls", str(config_file),
"reduction.cu", "-o", str(exe)], ...)
As described in our Safety Section, the objective includes timeouts, correctness checking ("Test passed" validation), and returns INVALID_SCORE on any failure.
Configuring the search#
The search space is fetched automatically based on the detected CUDA version:
from compileiq.search_spaces.compilers import NvccSearchSpace
cuda_version = re.search(r"release (\d+\.\d+),", version_output).group(1)
search_space = NvccSearchSpace(version=cuda_version)
config = SearchConfiguration(
problem_type=ProblemType.MIN,
generations=10,
pool_size=15,
)
tuner = Search(
objective_function=objective,
search_space=search_space,
search_config=config,
)
results = tuner.start(num_workers=1)
The script automatically runs a baseline (no compiler controls), then reports the speedup achieved by the best configuration found.
Running the example#
# Run optimization
python optimize_reduction.py --arch sm_100
# Benchmark the saved config
python optimize_reduction.py --benchmark-only \
--nvcc-options "--apply-controls reduction_best_config.bin"
Comparison with the PTXAS example#
This NVCC example differs from the PTXAS spill example in two key ways:
Full compilation: NVCC compiles
.cusource to a runnable binary, so each evaluation is slower but measures actual runtime performance.Runtime metric: The objective minimizes execution time rather than register spills.
Both examples follow the same pattern: define an objective, fetch a compiler search space, and run a CompileIQ search.
NOTE: ACFs can contain controls for specific compilers. When using a PTXAS ACF with NVCC, pass it directly to PTXAS via:
nvcc -Xptxas="--apply-controls=best_config.bin" kernel.cu