v1.22.1 is recommended for cuDNN 9.20.0 and later. This release brings a new PyTorch-native MoE operator and a Blackwell SDPA forward kernel.
PyTorch Custom Operator for MoE
A new PyTorch custom operator wraps cuDNN’s Mixture of Experts Grouped GEMM functionality, giving you efficient expert selection and matrix multiplication through a native torch.ops interface — no manual graph building required. Just like the SDPA custom op from v1.22.0, this operator is autograd-compatible and works with torch.compile.
More native custom PyTorch operators are planned for upcoming releases.
Blackwell SDPA Forward Kernel
A new native SDPA forward kernel for Blackwell supporting head dimension 256, built with cuteDSL. This complements the Blackwell backward kernel shipped in v1.22.0, giving you full forward+backward coverage at d=256 on Blackwell.
Requires nvidia-cutlass-dsl[cu13]==4.4.1. Available through the torch operator or standalone API.
Grouped GEMM Weight-Gradient
New kernel exposures for weight-gradient computations in MoE scenarios:
GroupedGemmWgradSm100— SM100 (Blackwell) weight-gradient kernelgrouped_gemm_wgrad_wrapper_sm100— wrapper API for easy integration
These kernels enable efficient backward passes through MoE expert layers on Blackwell hardware.
Full changelog: v1.22.1 on GitHub