cuDNN Frontend v1.19.1

This is a patch release that fixes two regressions from v1.19.0 — if you skipped 1.19.0, this is the one to grab.

What’s Fixed

Pinned pybind version to prevent build failures with older pybind11 versions.
Restored CUDA 12 toolkit support that was accidentally dropped in v1.19.0. If you’re still on CUDA 12 (and many of you are), this is an important fix.

Core Features (from v1.19.0)

Open-Source Blackwell/Hopper SDPA Fprop Kernels

This is a big deal. We now ship open-source SDPA forward pass kernels for both Blackwell and Hopper GPUs. These support causal masking and output statistics (LSE, SE, Max) for use in the backward pass.

Grouped GEMM + dSwiGLU Fusion

Block-scaled contiguous grouped GEMM with a dSwiGLU backward epilogue, targeting SM100+ (Blackwell). Designed specifically for MoE workloads where you need to fuse the backward activation into the GEMM.

General Improvements

Replaced multiple SM version device queries with a single query during graph validation. You can skip even that by passing sm_version directly on the graph object.
Fixed a CUDA graph logging crash that occurred in certain scenarios.
Reduced CPU overhead of the open-source API by switching to tvm-ffi internally.
New cudnn-repro tool for generating standalone reproducers from frontend logs — makes it much easier to file bug reports.

SDPA Enhancements

Improved support checks for cleaner error surfaces when a configuration isn’t supported.
Python bindings for the score-mod backward function.
Independent SDPA stats generation (LSE, SE, Max) during fprop — requires cuDNN backend v9.20.0+.

Normalization

New benchmark results posted for GB200, GB300, and H200 — check the repo for the numbers.

Full changelog: v1.19.1 on GitHub