Fresh off the press — v1.22.0 brings tighter PyTorch integration and some serious kernel-level improvements for Blackwell.

General Improvements

This release introduces PyTorch custom operator wrapping for cuDNN’s Scaled Dot-Product Attention (SDPA). The new scaled_dot_product_attention entry point matches PyTorch’s native signature, making it a near drop-in replacement when you want cuDNN-accelerated attention in your PyTorch models.

We’ve also introduced a preindexed execute method that reduces CPU overhead — helpful when you’re running the same graph repeatedly and don’t want to pay the dispatch tax every time.

The reproducer tool has been enhanced for SDPA failure reporting across fp8 data types, making it easier to file actionable bug reports.

Heads up: we’ll be rolling out more native custom torch ops in upcoming releases.

Open-Source Kernels

Blackwell SDPA Backward

The Blackwell sdpa bprop kernel now supports head dimension 256 via cuteDSL. This requires nvidia-cutlass-dsl[cu13]==4.4.1.

Grouped GEMM Updates

  • Grouped GEMM + quantize kernels now support dynamic shape/layout via an environment toggle — no more recompiling when your batch shapes change.
  • Grouped GEMM + GLU/SwiGLU now supports optional bias fusion in both dense and discrete modes, with partial-N support.

Fixes

  • fp8 with packed variable sequences (THD format) is no longer supported for SM90 (Hopper). If you’re on Hopper and using THD, you’ll need to switch formats.
  • Fixed an sdpa fp8 failure issue that showed up with CUDA toolkit 12.9.

Full changelog: v1.22.0 on GitHub