This release is all about MoE (Mixture of Experts) — if you’re building or optimizing MoE models, v1.21.0 has a lot for you.
General Improvements
The big headline here: we’ve eliminated the CUDA driver API dependency, so you can now build cuDNN Frontend without linking directly against the CUDA driver. This simplifies builds, especially in containerized or cross-compilation environments.
Open-Source Kernels
A whole batch of new GEMM fusion kernels designed for MoE workloads:
Grouped GEMM + GLU
Unified API supporting both dense and discrete MoE layouts. Optional bias fusion included, so you can fold the bias add into the GEMM without an extra kernel launch.
Grouped GEMM + dGLU
The backward pass counterpart. Supports the same dense/discrete layouts for seamless training loops.
Discrete Grouped GEMM + SwiGLU
Per-expert-pointer variant for MoE workloads where each expert has its own weight tensor at a different memory address. This avoids the overhead of gathering weights into contiguous memory.
Discrete Grouped GEMM + dSwiGLU
Backward variant using the dSwiGLU/dGeGLU epilogue. Same per-expert-pointer pattern as the forward pass.
Grouped GEMM + dSwiGLU
Fuses the dSwiGLU activation into the backward GEMM for FC1 layers.
Grouped GEMM + Quant
Output quantization for MoE FC2/dFC1 workloads — quantize the GEMM output in-place, no extra kernel needed.
Full changelog: v1.21.0 on GitHub