cuDNN Frontend v1.23.0

v1.23.0 is the recommended frontend for cuDNN 9.21.0 and later. Plenty of new surface area in this one — let’s run through the highlights.

Heads up: pip wheels for Python 3.14t are now published, so free-threaded folks can stop building from source.

New APIs 🚀

Causal Conv1d

A new depthwise causal 1-D convolution with optional fused SiLU activation:

y = activation(conv1d_causal(x, w) + b)

Forward and backward both work with torch.autograd and torch.compile. Requires cuDNN 9.22.0. Windows support is still pending.

Graph API additions (require cuDNN 9.22.0)

A handful of long-requested ops landed on the graph API:

Graph::transpose — new Transpose_attributes(permutation, optional compute dtype, name).
Strided slice — Slice_attributes::set_strides lets you pass per-axis steps; output shape and strides are inferred accordingly. The Python pygraph.slice now honors each dim’s slice.step.
In-place concatenate — Concatenate_attributes::set_in_place_index (optional). When unset, concatenate runs out-of-place per backend rules.
Reshape mode — new ReshapeMode_t(VIEW_ONLY, LOGICAL) plus Reshape_attributes::set_reshape_mode, so reshapes can opt into view-style vs. lexicographic logical reshape.
Compile-time scalar constants — cudnn.scalar_type(RUNTIME_PARAM, COMPILE_TIME_CONST) and new Graph::tensor(scalar, ScalarType) overloads. Scalars can now be either execution-time variant-pack inputs or baked into the plan as constants. Tensor_attributes can be marked accordingly.

Open-source kernels 🚀

Several new fused kernels are available in the open-source kernel library:

GEMM + sReLU — squared-ReLU fused with GEMM.
GEMM + dsReLU — derivative-of-squared-ReLU fused with GEMM (for the backward pass).
Grouped GEMM + GLU + Hadamard — dense grouped GEMM GLU forward fused with a Hadamard transform and per-expert AMAX reduction.
Grouped GEMM + sReLU — contiguous grouped squared-ReLU GEMM for MoE.
Grouped GEMM + dsReLU — contiguous and discrete grouped dsquared-ReLU GEMM for MoE.
RMSNorm + RHT + amax — fused CUTE DSL kernel for Blackwell (SM100+) that does RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.

Block-scale quantize fix

The block-scale matmul quantize path now correctly handles the 128x4 reordered scale layout (TensorReordering_t::F8_128x4). When that reordering is set on the scale tensor, the frontend automatically pads the inferred scale dimensions to align with the 128x4 block structure — non-batch, non-axis dims pad to multiples of 128, and the quantize-axis dim pads to multiples of 4.

General Improvements ✨

Grouped GEMM defaults to dynamic MNKL compilation across all of GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0 to fall back to the previous M-only dynamic behavior.
Caller-provided output buffers are now supported in the Grouped GEMM wgrad wrappers (wgrad_tensor for dense, wgrad_ptrs for discrete).
Removed an unused internal c_tensor from the Grouped GEMM quant path.

Bug fixes 🐛

Fixed a Grouped GEMM GLU bias compilation issue with 64B-aligned inputs under dynamic MNKL.
Fixed a Blackwell dropout issue when frontend v1.21 was used with cuDNN backend 9.21 / 9.22.

Benchmarking 📊

Updated SDPA benchmark numbers, and added Kimi-K2.6, LTX-2, Qwen 2.5, and Wan2.2 to the benchmark results page.

Acknowledgements

Thanks to @haowen-han for fixing a bug in the block-scale matmul sample.

Full changelog: v1.23.0 on GitHub