A focused release with a nice new fused kernel and broader GPU support.
Open-Source Kernels
Fused RMSNorm + SiLU
The highlight of this release: a single-kernel fusion of RMSNorm + SiLU optimized for B200. This targets the WAN VAE decoder pattern specifically, but it’s useful anywhere you have RMSNorm followed by SiLU activation.
Supported across SM80 through SM103 GPUs — so Ampere, Hopper, and Blackwell are all covered.
Why does this matter? Fusing normalization and activation into a single kernel eliminates the intermediate memory round-trip. For memory-bandwidth-bound operations like normalization, that’s a significant win.
Broader GPU Support
The following kernels are now runnable on GB300:
- GEMM + Amax
- GEMM + SwiGLU
- Grouped GEMM + SwiGLU
- Grouped GEMM + dSwiGLU
- NSA (Neighborhood Sparse Attention) kernels
If you have access to GB300 hardware, these should just work.
Improvements
The reproducer tool for SDPA failure reporting has been enhanced further — continuing the trend from previous releases of making it easier to diagnose and report issues.
Full changelog: v1.20.0 on GitHub