A focused release with a nice new fused kernel and broader GPU support.

Open-Source Kernels

Fused RMSNorm + SiLU

The highlight of this release: a single-kernel fusion of RMSNorm + SiLU optimized for B200. This targets the WAN VAE decoder pattern specifically, but it’s useful anywhere you have RMSNorm followed by SiLU activation.

Supported across SM80 through SM103 GPUs — so Ampere, Hopper, and Blackwell are all covered.

Why does this matter? Fusing normalization and activation into a single kernel eliminates the intermediate memory round-trip. For memory-bandwidth-bound operations like normalization, that’s a significant win.

Broader GPU Support

The following kernels are now runnable on GB300:

  • GEMM + Amax
  • GEMM + SwiGLU
  • Grouped GEMM + SwiGLU
  • Grouped GEMM + dSwiGLU
  • NSA (Neighborhood Sparse Attention) kernels

If you have access to GB300 hardware, these should just work.

Improvements

The reproducer tool for SDPA failure reporting has been enhanced further — continuing the trend from previous releases of making it easier to diagnose and report issues.

Full changelog: v1.20.0 on GitHub