This is a patch release that fixes two regressions from v1.19.0 — if you skipped 1.19.0, this is the one to grab.
What’s Fixed
- Pinned pybind version to prevent build failures with older pybind11 versions.
- Restored CUDA 12 toolkit support that was accidentally dropped in v1.19.0. If you’re still on CUDA 12 (and many of you are), this is an important fix.
Core Features (from v1.19.0)
Open-Source Blackwell/Hopper SDPA Fprop Kernels
This is a big deal. We now ship open-source SDPA forward pass kernels for both Blackwell and Hopper GPUs. These support causal masking and output statistics (LSE, SE, Max) for use in the backward pass.
Grouped GEMM + dSwiGLU Fusion
Block-scaled contiguous grouped GEMM with a dSwiGLU backward epilogue, targeting SM100+ (Blackwell). Designed specifically for MoE workloads where you need to fuse the backward activation into the GEMM.
General Improvements
- Replaced multiple SM version device queries with a single query during graph validation. You can skip even that by passing
sm_versiondirectly on the graph object. - Fixed a CUDA graph logging crash that occurred in certain scenarios.
- Reduced CPU overhead of the open-source API by switching to tvm-ffi internally.
- New
cudnn-reprotool for generating standalone reproducers from frontend logs — makes it much easier to file bug reports.
SDPA Enhancements
- Improved support checks for cleaner error surfaces when a configuration isn’t supported.
- Python bindings for the score-mod backward function.
- Independent SDPA stats generation (LSE, SE, Max) during fprop — requires cuDNN backend v9.20.0+.
Normalization
New benchmark results posted for GB200, GB300, and H200 — check the repo for the numbers.
Full changelog: v1.19.1 on GitHub