v1.22.1 is recommended for cuDNN 9.20.0 and later. This release brings a new PyTorch-native MoE operator and a Blackwell SDPA forward kernel.

PyTorch Custom Operator for MoE

A new PyTorch custom operator wraps cuDNN’s Mixture of Experts Grouped GEMM functionality, giving you efficient expert selection and matrix multiplication through a native torch.ops interface — no manual graph building required. Just like the SDPA custom op from v1.22.0, this operator is autograd-compatible and works with torch.compile.

More native custom PyTorch operators are planned for upcoming releases.

Blackwell SDPA Forward Kernel

A new native SDPA forward kernel for Blackwell supporting head dimension 256, built with cuteDSL. This complements the Blackwell backward kernel shipped in v1.22.0, giving you full forward+backward coverage at d=256 on Blackwell.

Requires nvidia-cutlass-dsl[cu13]==4.4.1. Available through the torch operator or standalone API.

Grouped GEMM Weight-Gradient

New kernel exposures for weight-gradient computations in MoE scenarios:

  • GroupedGemmWgradSm100 — SM100 (Blackwell) weight-gradient kernel
  • grouped_gemm_wgrad_wrapper_sm100 — wrapper API for easy integration

These kernels enable efficient backward passes through MoE expert layers on Blackwell hardware.

Full changelog: v1.22.1 on GitHub