v1.23.0 is the recommended frontend for cuDNN 9.21.0 and later. Plenty of new surface area in this one β letβs run through the highlights.
Heads up: pip wheels for Python 3.14t are now published, so free-threaded folks can stop building from source.
New APIs π
Causal Conv1d
A new depthwise causal 1-D convolution with optional fused SiLU activation:
y = activation(conv1d_causal(x, w) + b)
Forward and backward both work with torch.autograd and torch.compile. Requires cuDNN 9.22.0. Windows support is still pending.
Graph API additions (require cuDNN 9.22.0)
A handful of long-requested ops landed on the graph API:
Graph::transposeβ newTranspose_attributes(permutation, optional compute dtype, name).- Strided slice β
Slice_attributes::set_strideslets you pass per-axis steps; output shape and strides are inferred accordingly. The Pythonpygraph.slicenow honors each dimβsslice.step. - In-place concatenate β
Concatenate_attributes::set_in_place_index(optional). When unset, concatenate runs out-of-place per backend rules. - Reshape mode β new
ReshapeMode_t(VIEW_ONLY, LOGICAL)plusReshape_attributes::set_reshape_mode, so reshapes can opt into view-style vs. lexicographic logical reshape. - Compile-time scalar constants β
cudnn.scalar_type(RUNTIME_PARAM, COMPILE_TIME_CONST)and newGraph::tensor(scalar, ScalarType)overloads. Scalars can now be either execution-time variant-pack inputs or baked into the plan as constants.Tensor_attributescan be marked accordingly.
Open-source kernels π
Several new fused kernels are available in the open-source kernel library:
- GEMM + sReLU β squared-ReLU fused with GEMM.
- GEMM + dsReLU β derivative-of-squared-ReLU fused with GEMM (for the backward pass).
- Grouped GEMM + GLU + Hadamard β dense grouped GEMM GLU forward fused with a Hadamard transform and per-expert AMAX reduction.
- Grouped GEMM + sReLU β contiguous grouped squared-ReLU GEMM for MoE.
- Grouped GEMM + dsReLU β contiguous and discrete grouped dsquared-ReLU GEMM for MoE.
- RMSNorm + RHT + amax β fused CUTE DSL kernel for Blackwell (SM100+) that does RMS normalization, a block-diagonal Hadamard transform with fixed block size 16, and a per-CTA amax reduction.
Block-scale quantize fix
The block-scale matmul quantize path now correctly handles the 128x4 reordered scale layout (TensorReordering_t::F8_128x4). When that reordering is set on the scale tensor, the frontend automatically pads the inferred scale dimensions to align with the 128x4 block structure β non-batch, non-axis dims pad to multiples of 128, and the quantize-axis dim pads to multiples of 4.
General Improvements β¨
- Grouped GEMM defaults to dynamic MNKL compilation across all of GLU, dGLU, SwiGLU, dSwiGLU, SReLU, dSReLU, and quant wrappers. Set
CUDNN_FE_GROUPED_GEMM_DYNAMIC_MNKL=0to fall back to the previous M-only dynamic behavior. - Caller-provided output buffers are now supported in the Grouped GEMM wgrad wrappers (
wgrad_tensorfor dense,wgrad_ptrsfor discrete). - Removed an unused internal
c_tensorfrom the Grouped GEMM quant path.
Bug fixes π
- Fixed a Grouped GEMM GLU bias compilation issue with 64B-aligned inputs under dynamic MNKL.
- Fixed a Blackwell dropout issue when frontend v1.21 was used with cuDNN backend 9.21 / 9.22.
Benchmarking π
Updated SDPA benchmark numbers, and added Kimi-K2.6, LTX-2, Qwen 2.5, and Wan2.2 to the benchmark results page.
Acknowledgements
Thanks to @haowen-han for fixing a bug in the block-scale matmul sample.
Full changelog: v1.23.0 on GitHub