GEMM Speedups Across Precisions

Transformer Engine supports multiple low-precision formats for the linear-layer GEMMs that dominate transformer training time: BF16, FP8 tensor-wise scaling (CurrentScaling, DelayedScaling), FP8 Block Scaling, MXFP8, and NVFP4. Each step down in precision can accelerate the 12 GEMMs per transformer layer (4 Fprop + 4 Dgrad + 4 Wgrad), but the actual speedup depends on your model’s matrix dimensions.

A benchmark tool is provided at benchmarks/gemm/benchmark_gemm.py to measure GEMM performance for your specific model config. See the full tutorial for usage details.

The benchmark reports two numbers for each precision:

Autocast – the end-to-end speedup seen in real training. It includes both the GEMM and the per-step quantization work: converting the input tensors to the low-precision format and computing their scaling factors.
Pre-quantized – the raw GEMM kernel throughput with inputs already in the target format, excluding per-step quantization. Because the inputs are already quantized, recipes that differ only in how scaling factors are computed (e.g. DelayedScaling vs CurrentScaling) collapse to the same kernel here.

Example: 5B Model on B300 (Blackwell)

Autocast model config benchmark showing per-layer GEMM time breakdown across precisions. — Autocast model config benchmark on NVIDIA B300 – per-layer GEMM time breakdown by precision and operation (Fprop+Dgrad and Wgrad).

For a 5B-parameter model (hidden=4096, intermediate=16384, 24 layers), MXFP8 delivers ~1.44x and NVFP4 delivers ~2.03x over BF16 in autocast mode. FP8 DelayedScaling reaches 1.61x, outperforming both FP8 CurrentScaling (1.41x) and MXFP8 on Blackwell.

In pre-quantized mode the gap widens: NVFP4 reaches 3.55x over BF16, nearly double its autocast speedup. The difference is the per-step quantization cost, which the pre-quantized number excludes.

Example: 5B Model on H200 (Hopper)

Autocast model config benchmark showing per-layer GEMM time breakdown across precisions on H200. — Autocast model config benchmark on NVIDIA H200 NVL – per-layer GEMM time breakdown by precision and operation (Fprop+Dgrad and Wgrad).

For the same 5B-parameter model on H200 (Hopper), the available precisions are BF16, FP8 CurrentScaling, FP8 DelayedScaling, and FP8 Block Scaling. FP8 DelayedScaling delivers ~1.69x over BF16, followed by FP8 CurrentScaling at ~1.57x and FP8 Block Scaling at ~1.41x. FP8 Block Scaling runs natively on Hopper and is the only block-scaled FP8 recipe available on this device. In pre-quantized mode, raw FP8 reaches 1.92x over BF16.

Speedup Is Shape-Dependent

The speedup from lower precision depends on the matrix dimensions set by your model config. Large GEMMs – from big hidden and intermediate sizes and high token counts (micro_batch_size * sequence_length) – amortize the fixed quantization overhead over more compute and see meaningful speedups. Small GEMMs (e.g. the attention output projection, with K=N=hidden_size and no expansion) may see little benefit or even a slowdown when the overhead outweighs what the faster kernel saves.

This is why you should benchmark with your actual config: the theoretical tensor-core speedup (e.g. 2x for FP4 vs FP8) is an upper bound that assumes the GEMM is large enough to saturate the hardware. It also makes the tool useful for architecture co-design – run candidate configs through it before committing to a training run.

See the full tutorial for detailed analysis on both Blackwell and Hopper, including per-operation (Fprop, Dgrad, Wgrad) breakdowns and manual shape mode for non-standard architectures.