This release is all about MoE (Mixture of Experts) — if you’re building or optimizing MoE models, v1.21.0 has a lot for you.

General Improvements

The big headline here: we’ve eliminated the CUDA driver API dependency, so you can now build cuDNN Frontend without linking directly against the CUDA driver. This simplifies builds, especially in containerized or cross-compilation environments.

Open-Source Kernels

A whole batch of new GEMM fusion kernels designed for MoE workloads:

Grouped GEMM + GLU

Unified API supporting both dense and discrete MoE layouts. Optional bias fusion included, so you can fold the bias add into the GEMM without an extra kernel launch.

Grouped GEMM + dGLU

The backward pass counterpart. Supports the same dense/discrete layouts for seamless training loops.

Discrete Grouped GEMM + SwiGLU

Per-expert-pointer variant for MoE workloads where each expert has its own weight tensor at a different memory address. This avoids the overhead of gathering weights into contiguous memory.

Discrete Grouped GEMM + dSwiGLU

Backward variant using the dSwiGLU/dGeGLU epilogue. Same per-expert-pointer pattern as the forward pass.

Grouped GEMM + dSwiGLU

Fuses the dSwiGLU activation into the backward GEMM for FC1 layers.

Grouped GEMM + Quant

Output quantization for MoE FC2/dFC1 workloads — quantize the GEMM output in-place, no extra kernel needed.

Full changelog: v1.21.0 on GitHub