v1.23.0 is the recommended frontend for cuDNN 9.21.0 and later. Plenty of new surface area in this one β€” let’s run through the highlights.

Heads up: pip wheels for Python 3.14t are now published, so free-threaded folks can stop building from source.

New APIs πŸš€

Causal Conv1d

A new depthwise causal 1-D convolution with optional fused SiLU activation:

y = activation(conv1d_causal(x, w) + b)

Forward and backward both work with torch.autograd and torch.compile. Requires cuDNN 9.22.0. Windows support is still pending.

Graph API additions (require cuDNN 9.22.0)

A handful of long-requested ops landed on the graph API:

  • Graph::transpose β€” new Transpose_attributes(permutation, optional compute dtype, name).
  • Strided slice β€” Slice_attributes::set_strides lets you pass per-axis steps; output shape and strides are inferred accordingly. The Python pygraph.slice now honors each dim’s slice.step.
  • In-place concatenate β€” Concatenate_attributes::set_in_place_index (optional). When unset, concatenate runs out-of-place per backend rules.
  • Reshape mode β€” new ReshapeMode_t(VIEW_ONLY, LOGICAL) plus Reshape_attributes::set_reshape_mode, so reshapes can opt into view-style vs. lexicographic logical reshape.
  • Compile-time scalar constants β€” cudnn.scalar_type(RUNTIME_PARAM, COMPILE_TIME_CONST) and new Graph::tensor(scalar, ScalarType) overloads. Scalars can now be either execution-time variant-pack inputs or baked into the plan as constants. Tensor_attributes can be marked accordingly.

Open-source kernels πŸš€

Several new fused kernels are available in the open-source kernel library:

  • GEMM + sReLU β€” squared-ReLU fused with GEMM.
  • GEMM + dsReLU β€” derivative-of-squared-ReLU fused with GEMM (for the backward pass).
  • Grouped GEMM + GLU + Hadamard β€” dense grouped GEMM GLU forward fused with a Hadamard transform and per-expert AMAX reduction.
  • Grouped GEMM + sReLU β€” contiguous grouped squared-ReLU GEMM for MoE.
  • Grouped GEMM + dsReLU β€” contiguous and discrete grouped dsquared-ReLU GEMM for MoE.
  • RMSNorm + RHT + amax β€” fused CUTE DSL kernel for Blackwell (SM100+) that does RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

Block-scale quantize fix

The block-scale matmul quantize path now correctly handles the 128x4 reordered scale layout (TensorReordering_t::F8_128x4). When that reordering is set on the scale tensor, the frontend automatically pads the inferred scale dimensions to align with the 128x4 block structure β€” non-batch, non-axis dims pad to multiples of 128, and the quantize-axis dim pads to multiples of 4.

General Improvements ✨

  • Grouped GEMM defaults to dynamic MNKL compilation across all of GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0 to fall back to the previous M-only dynamic behavior.
  • Caller-provided output buffers are now supported in the Grouped GEMM wgrad wrappers (wgrad_tensor for dense, wgrad_ptrs for discrete).
  • Removed an unused internal c_tensor from the Grouped GEMM quant path.

Bug fixes πŸ›

  • Fixed a Grouped GEMM GLU bias compilation issue with 64B-aligned inputs under dynamic MNKL.
  • Fixed a Blackwell dropout issue when frontend v1.21 was used with cuDNN backend 9.21 / 9.22.

Benchmarking πŸ“Š

Updated SDPA benchmark numbers, and added Kimi-K2.6, LTX-2, Qwen 2.5, and Wan2.2 to the benchmark results page.

Acknowledgements

Thanks to @haowen-han for fixing a bug in the block-scale matmul sample.

Full changelog: v1.23.0 on GitHub