GEMM Profiling Tutorial

This tutorial shows how to go from a transformer model config to concrete GEMM shapes, benchmark them across precisions (BF16, FP8 CurrentScaling, FP8 DelayedScaling, FP8 Block, MXFP8, NVFP4), and compute expected speedups. If you are using NVIDIA Transformer Engine – which handles the quantization and kernel dispatch for these precision modes – this is how you derive the matrix multiplications your model runs and measure where your time goes.

The benchmark tool is at benchmarks/gemm/benchmark_gemm.py.

Quick Start: Model Config Mode

The benchmark tool takes model hyperparameters directly and handles everything – deriving GEMM shapes, benchmarking across precisions, and computing the full speedup analysis – in a single command:

python benchmarks/gemm/benchmark_gemm.py \
  --hidden_size 4096 \
  --intermediate_size 16384 \
  --num_attention_heads 32 \
  --num_hidden_layers 24 \
  --micro_batch_size 31 \
  --sequence_length 512 \
  -o ./gemm_speedup.png

On Hopper (H100/H200), skip MXFP8 and NVFP4 which require Blackwell:

python benchmarks/gemm/benchmark_gemm.py \
  --hidden_size 4096 \
  --intermediate_size 16384 \
  --num_attention_heads 32 \
  --num_hidden_layers 24 \
  --micro_batch_size 31 \
  --sequence_length 512 \
  --no-fp8 --no-fp4 \
  -o ./gemm_speedup.png

By default the tool runs in autocast mode, which is what Transformer Engine does during training: inputs are dynamically quantized to the target precision before each GEMM, so the measured time includes both the quantization cost and the GEMM kernel itself. This gives the realistic end-to-end picture.

The tool computes M = 31 x 512 = 15,872 tokens, derives all 12 GEMM shapes (4 Fprop + 4 Dgrad + 4 Wgrad), benchmarks each across enabled precisions, and prints the full results. Fprop, Dgrad, and Wgrad shapes are all benchmarked separately to capture the impact of different matrix aspect ratios on kernel selection.

GEMM Benchmark (Model Config Mode) on NVIDIA B300 SXM6 AC
Timing method: CUDA events
Warmup iterations: 10, Timed iterations: 100
Mode: Autocast (includes quantization overhead)

==========================================================================================
Model Config: hidden=4096, intermediate=16384, heads=32, layers=24
Tokens per step: M = 31 x 512 = 15,872
==========================================================================================

Fprop Shapes:
------------------------------------------------------------------------------------------
Op                     Shape                       BF16 ms FP8Current ms FP8Delayed ms   MXFP8 ms   NVFP4 ms
------------------------------------------------------------------------------------------
QKV Proj               15872x4096x12288              1.071      0.605      0.503      0.579      0.392
Attn Out               15872x4096x4096               0.307      0.317      0.231      0.269      0.256
MLP Up                 15872x4096x16384              1.393      0.924      0.850      0.924      0.635
MLP Down               15872x16384x4096              1.426      1.033      0.901      1.076      0.649
------------------------------------------------------------------------------------------
Fprop sum (ms):                                     4.196      2.879      2.486      2.847      1.932

==========================================================================================
Per-Layer GEMM Time:
                                  BF16 ms FP8Current ms FP8Delayed ms   MXFP8 ms   NVFP4 ms
Fprop:                              4.196      2.879      2.486      2.847      1.932
Dgrad:                              4.290      3.063      2.621      3.045      2.189
Fprop + Dgrad:                      8.486      5.941      5.107      5.892      4.122
Wgrad:                              4.272      3.205      2.695      3.092      2.331
Per-layer total:                   12.758      9.147      7.802      8.984      6.453

Full Model (24 layers):
Total GEMM time (ms):             306.192    219.522    187.246    215.608    154.869

Estimated GEMM Speedups:
  MXFP8 vs BF16:  1.42x
  NVFP4 vs MXFP8: 1.39x
  NVFP4 vs BF16:  1.98x
==========================================================================================

Autocast model config benchmark showing per-layer GEMM time breakdown across precisions. — Autocast model config benchmark on NVIDIA B300 – per-layer GEMM time breakdown by precision and operation (Fprop+Dgrad and Wgrad).

Autocast vs Pre-quantized

To isolate raw GEMM kernel performance, add --pre-quantize. This pre-quantizes all inputs once before the timed loop, so the measured time reflects only the GEMM kernel execution – no dynamic quantization, no block scaling computation, no format conversion during the timed region.

Note

FP8 DelayedScaling always runs in autocast mode, even with --pre-quantize, because it relies on an amax history that requires dynamic quantization. Its times are therefore not directly comparable to other precisions in pre-quantized mode.

python benchmarks/gemm/benchmark_gemm.py \
  --hidden_size 4096 \
  --intermediate_size 16384 \
  --num_attention_heads 32 \
  --num_hidden_layers 24 \
  --micro_batch_size 31 \
  --sequence_length 512 \
  --pre-quantize \
  -o ./gemm_speedup_prequant.png

==========================================================================================
Per-Layer GEMM Time:
                                  BF16 ms FP8Current ms FP8Delayed ms   MXFP8 ms   NVFP4 ms
Fprop:                              4.250      2.158      2.555      2.365      1.254
Dgrad:                              4.434      2.329      2.745      2.397      1.305
Fprop + Dgrad:                      8.684      4.487      5.300      4.762      2.559
Wgrad:                              4.418      2.325      2.822      2.400      1.205
Per-layer total:                   13.102      6.812      8.123      7.161      3.764

Full Model (24 layers):
Total GEMM time (ms):             314.445    163.493    194.944    171.869     90.337

Estimated GEMM Speedups:
  MXFP8 vs BF16:  1.83x
  NVFP4 vs MXFP8: 1.90x
  NVFP4 vs BF16:  3.48x
==========================================================================================

Pre-quantized model config benchmark showing raw GEMM kernel throughput. — Pre-quantized model config benchmark – raw GEMM kernel throughput without quantization overhead.

Comparing the two tells you exactly how much quantization overhead costs: NVFP4 vs BF16 goes from 1.98x (autocast) to 3.48x (kernel-only). The gap between these two numbers is the overhead from dynamic quantization, Hadamard transforms, and block scaling that occurs in each training step.

Note

FP8 Block Scaling targets Hopper (SM90+), where it runs natively. On Blackwell (SM100+), FP8 Block is emulated via MXFP8 for backward compatibility – prefer MXFP8 or NVFP4 on Blackwell. See the Worked Example: 5B Model on H200 section below for Hopper-native FP8 Block Scaling benchmarks.

When to use which: Use autocast results for predicting real training speedups – that is what Transformer Engine actually does during training. Use pre-quantized results to understand whether quantization overhead is the bottleneck, or to compare raw tensor core throughput across precisions independent of the quantization implementation.

Worked Example: 5B Model on B300

Using a 5B-parameter model config (hidden=4096, intermediate=16384, MBS=31, seq_len=512, 24 layers), the full model config benchmark was run on a B300.

Looking at the per-shape NVFP4 vs MXFP8 speedups from the Fprop results:

QKV proj:   0.579 / 0.392  =  1.48x
Attn out:   0.269 / 0.256  =  1.05x  (barely faster -- overhead nearly matches GEMM gain)
MLP up:     0.924 / 0.635  =  1.46x
MLP down:   1.076 / 0.649  =  1.66x

A few things stand out:

The attn out GEMM (15872x4096x4096) gets minimal benefit from FP4. At 0.256 ms (NVFP4) vs 0.307 ms (BF16), the speedup is only 1.20x. This is the smallest weight matrix (4096x4096), and it is barely large enough for lower precision to overcome the overhead.

The big GEMMs show real but sub-theoretical gains. The FP4 tensor cores deliver 1.46–1.66x over MXFP8 on the large GEMMs – well short of the theoretical 2–3x from the hardware spec. Once you include the dead-weight attn out, the blended Fprop speedup drops to 1.47x. After adding Wgrad times, non-GEMM overhead (attention, layernorm, communication), and NVFP4-specific quantization costs (Hadamard transforms, stochastic rounding, 2D scaling), the end-to-end gap between NVFP4 and MXFP8 in training is consistent with these kernel-level numbers.

FP8 DelayedScaling is surprisingly competitive. At 7.80 ms/layer in autocast mode, it outperforms both FP8 CurrentScaling (9.15 ms) and MXFP8 (8.98 ms) on Blackwell. However, in pre-quantized mode FP8 CurrentScaling pulls ahead (6.81 ms vs 8.12 ms), suggesting DelayedScaling’s amax-history approach has lower quantization overhead but similar raw kernel throughput.

The pre-quantized results reveal the true kernel potential. Running with --pre-quantize removes quantization overhead entirely, and NVFP4 vs BF16 jumps from 1.98x (autocast) to 3.48x (kernel-only). This shows the FP4 tensor cores are delivering real speedup – it is the quantization overhead in autocast mode that narrows the gap.

The Fprop vs Dgrad comparison reveals that the x2 approximation is imprecise for quantized formats. While BF16 Dgrad is within 2% of Fprop, quantized formats show 5–13% slower Dgrad sums. The QKV Proj Dgrad is especially asymmetric – 33–51% slower than Fprop for FP8/FP4 – because swapping K (4096) and N (12288) dramatically changes the matrix aspect ratio and kernel selection.

Worked Example: 5B Model on H200

Using the same 5B-parameter model config on an NVIDIA H200 NVL (Hopper, SM90), the available precision recipes are BF16, FP8 CurrentScaling, FP8 DelayedScaling, and FP8 Block Scaling. MXFP8 and NVFP4 require Blackwell (SM100+/SM120+) and are skipped with --no-fp8 --no-fp4.

GEMM Benchmark (Model Config Mode) on NVIDIA H200 NVL
Timing method: CUDA events
Mode: Autocast (includes quantization overhead)

==========================================================================================
Per-Layer GEMM Time:
                                  BF16 ms FP8Current ms FP8Delayed ms FP8Block ms
Fprop:                             10.653      6.503      6.188      7.425
Dgrad:                             10.813      6.795      6.306      7.636
Fprop + Dgrad:                     21.466     13.298     12.494     15.061
Wgrad:                             10.548      6.987      6.484      7.821
Per-layer total:                   32.014     20.285     18.978     22.882

Full Model (24 layers):
Total GEMM time (ms):             768.335    486.851    455.473    549.171

Estimated GEMM Speedups:
  FP8Delayed vs BF16:  1.69x
  FP8Current vs BF16:  1.58x
  FP8Block vs BF16:    1.40x
==========================================================================================

Autocast model config benchmark on H200 showing per-layer GEMM time breakdown. — Autocast model config benchmark on NVIDIA H200 NVL – per-layer GEMM time breakdown by precision.

FP8 DelayedScaling is the fastest FP8 recipe on Hopper. At 18.98 ms/layer (1.69x over BF16), it outperforms both FP8 CurrentScaling (20.29 ms, 1.58x) and FP8 Block Scaling (22.88 ms, 1.40x). This is the same ordering seen on Blackwell, where DelayedScaling also outperforms CurrentScaling in autocast mode.

FP8 Block Scaling delivers the smallest speedup. At 1.40x over BF16, block scaling is slower than both tensor-wise FP8 approaches in autocast mode. The block scaling overhead – computing per-block scale factors for both rowwise and columnwise data – is not fully offset by the FP8 tensor core gains at these shapes.

In pre-quantized mode (raw kernel throughput), FP8 Block Scaling is excluded because the pre-quantized path produces 2D-by-2D block-scaled inputs, which Hopper’s cuBLAS does not support. Only FP8 CurrentScaling and FP8 DelayedScaling are benchmarked:

==========================================================================================
Per-Layer GEMM Time:
                                  BF16 ms FP8Current ms FP8Delayed ms
Fprop:                             10.632      5.577      6.207
Dgrad:                             10.747      5.661      6.375
Fprop + Dgrad:                     21.379     11.238     12.582
Wgrad:                             10.530      5.547      6.517
Per-layer total:                   31.968     16.785     19.099

Full Model (24 layers):
Total GEMM time (ms):             767.242    402.838    458.375

Estimated GEMM Speedups:
  FP8Current vs BF16:  1.90x
  FP8Delayed vs BF16:  1.67x
==========================================================================================

Pre-quantized model config benchmark on H200. — Pre-quantized model config benchmark on H200 – raw GEMM kernel throughput.

In pre-quantized mode, FP8 CurrentScaling pulls ahead. Without quantization overhead, FP8 CurrentScaling reaches 1.90x over BF16, while FP8 DelayedScaling shows 1.67x. FP8 DelayedScaling still runs in autocast mode even with --pre-quantize (it relies on an amax history), so its pre-quantized times are close to its autocast times (458 ms vs 455 ms). The gap between CurrentScaling’s autocast (487 ms) and pre-quantized (403 ms) results reveals that ~17% of its autocast time is quantization overhead.

Interpreting the Results

Once you have the GEMM-only speedup, compare it against your observed end-to-end training speedup:

GEMM speedup ~ training speedup – GEMMs are the bottleneck, everything is working as expected.
GEMM speedup >> training speedup – overhead outside of GEMMs is eating the gains. For NVFP4 in particular, this overhead includes Random Hadamard transforms on Wgrad inputs, stochastic rounding on gradients, 2D block scaling for weights, and the extra memory pass for per-tensor amax computation.
GEMM speedup ~ 1.0 even in the microbenchmark – the FP4 kernels are not actually faster at these shapes, or they are silently falling back to FP8.

The last case is especially worth checking. Set NVTE_LOG_LEVEL=1 or inspect with Nsight Systems to confirm that Transformer Engine is actually dispatching FP4 kernels. TE can silently fall back to FP8 or BF16 for layers or ops that do not support FP4 yet.

What GEMMs Do Not Cover

The linear projection GEMMs are the only ops where Transformer Engine’s precision setting (BF16 vs FP8 Block vs MXFP8 vs NVFP4) affects compute performance. The other major consumers in a transformer layer are precision-agnostic – they run the same regardless of which TE mode you use:

Attention (QK^T and softmax*V): Runs in BF16/FP16 via FlashAttention regardless of linear layer precision.
LayerNorm / RMSNorm: Typically in FP32, negligible cost.
Activation functions: Element-wise, memory-bound, unaffected by weight precision.
AllReduce (DDP/FSDP): Communication cost, independent of compute precision.

In addition, NVFP4 introduces precision-specific overhead that falls outside the GEMM kernels but is unique to FP4 mode. These ops do not exist in BF16 or MXFP8 and represent additional cost that NVFP4 must overcome to deliver a net speedup:

Random Hadamard transforms: 16x16 batched matmuls applied to both Wgrad inputs to improve quantization quality.
Stochastic rounding: Applied to gradients before FP4 quantization.
2D block scaling: Weight scaling with finer granularity than MXFP8’s 1D scaling.
Per-tensor amax passes: Extra memory pass to compute scaling factors.

This distinction matters: the precision-agnostic ops dilute GEMM speedups equally across all modes, but the NVFP4-specific ops actively widen the gap between NVFP4’s raw kernel speedup and its end-to-end speedup. This is why the autocast vs pre-quantized comparison is informative – the pre-quantized numbers show what the tensor cores can do, while the autocast numbers include both categories of overhead.

Manual Shape Mode

If you need to benchmark shapes that do not map to a standard transformer config – diffusion models, mixture-of-experts, or non-standard architectures – or want to profile individual GEMMs in isolation, you can pass explicit MxKxN triplets with the --shapes flag:

# Fprop shapes for the 5B config
python benchmarks/gemm/benchmark_gemm.py -o roofline_fprop.png \
  --shapes 15872x4096x12288,15872x4096x4096,15872x4096x16384,15872x16384x4096

# Dgrad shapes (K and N swapped from Fprop)
python benchmarks/gemm/benchmark_gemm.py -o roofline_dgrad.png \
  --shapes 15872x12288x4096,15872x4096x4096,15872x16384x4096,15872x4096x16384

# Wgrad shapes
python benchmarks/gemm/benchmark_gemm.py -o roofline_wgrad.png \
  --shapes 4096x15872x12288,4096x15872x4096,4096x15872x16384,16384x15872x4096

This mode prints per-shape TFLOPS and ms but does not compute per-layer or full-model totals – you would sum the ms values and compute speedups manually. The --shapes flag is mutually exclusive with model config arguments.

Appendix: How the Shapes Are Derived

Note

This section is reference material – the tool handles all of this automatically. Read on if you want to understand the mechanics behind the shape derivation and speedup calculation.

The first thing to establish is M – the token dimension. Every linear layer in a transformer operates on a 2D matrix of shape [tokens, features], where tokens = micro_batch_size * sequence_length. For the example config:

M = 31 x 512 = 15,872

This is the batch dimension for every single GEMM in a forward or backward pass through one layer. It stays constant across all ops.

The Linear Layer Convention

Every linear layer computes Y = X @ W, which is a matrix multiply C = A x B where:

A is the activation: [M, K]
B is the weight: [K, N]
C is the output: [M, N]

The mapping is:

Symbol	Meaning
M	Number of tokens (`micro_batch_size * sequence_length`)
K	Input feature dimension (contracted/summed over)
N	Output feature dimension

Your model config gives you K and N. Your batch config gives you M. That is all you need.

Note

Throughout this guide and in the tool’s output, GEMM shapes are written as MxKxN – tokens x input features x output features. The --shapes flag uses the same ordering.

Forward Pass GEMMs

A standard transformer layer has four major linear projections.

1. QKV Projection

Projects the input into queries, keys, and values as a single fused linear layer:

Input features (K) = hidden_size = 4096
Output features (N) = 3 x hidden_size = 12,288

Y = X @ W_qkv
[15872, 4096] x [4096, 12288] -> [15872, 12288]

M = 15,872    K = 4,096    N = 12,288

2. Attention Output Projection

After attention, project back to the hidden dimension:

Input features (K) = hidden_size = 4096
Output features (N) = hidden_size = 4096

Y = X @ W_out
[15872, 4096] x [4096, 4096] -> [15872, 4096]

M = 15,872    K = 4,096    N = 4,096

3. MLP Up Projection (Gate + Up)

The MLP first projects up to the intermediate dimension. In gated architectures (SwiGLU, etc.), this is typically fused into a single projection:

Input features (K) = hidden_size = 4096
Output features (N) = intermediate_size = 16,384

Y = X @ W_up
[15872, 4096] x [4096, 16384] -> [15872, 16384]

M = 15,872    K = 4,096    N = 16,384

4. MLP Down Projection

Projects back from intermediate dimension to hidden dimension:

Input features (K) = intermediate_size = 16,384
Output features (N) = hidden_size = 4096

Y = X @ W_down
[15872, 16384] x [16384, 4096] -> [15872, 4096]

M = 15,872    K = 16,384    N = 4,096

Forward Summary

Op	Pass	M	K	N	Shape (MxKxN)	FLOPs (2MK*N)
QKV proj	Forward	15,872	4,096	12,288	15872x4096x12288	~1.60T
Attn out proj	Forward	15,872	4,096	4,096	15872x4096x4096	~0.53T
MLP up	Forward	15,872	4,096	16,384	15872x4096x16384	~2.13T
MLP down	Forward	15,872	16,384	4,096	15872x16384x4096	~2.13T
Total/layer						~6.39T

Backward Pass GEMMs

The backward pass through each linear layer produces two GEMMs: one for the gradient with respect to the input (dX), and one for the gradient with respect to the weights (dW).

Given forward Y = X @ W where X is [M, K] and W is [K, N]:

dX = dY @ W^T – The gradient flows back through the transposed weight matrix. The contraction axis is now N (the output features from the forward pass):

M = tokens    K = out_features (N from forward)    N = in_features (K from forward)

dW = X^T @ dY – The weight gradient contracts over the token dimension:

M = in_features    K = tokens    N = out_features

Full Backward Table

Op	Pass	M	K	N	Shape (MxKxN)
QKV proj	Backward (dX)	15,872	12,288	4,096	15872x12288x4096
QKV proj	Backward (dW)	4,096	15,872	12,288	4096x15872x12288
Attn out	Backward (dX)	15,872	4,096	4,096	15872x4096x4096
Attn out	Backward (dW)	4,096	15,872	4,096	4096x15872x4096
MLP up	Backward (dX)	15,872	16,384	4,096	15872x16384x4096
MLP up	Backward (dW)	4,096	15,872	16,384	4096x15872x16384
MLP down	Backward (dX)	15,872	4,096	16,384	15872x4096x16384
MLP down	Backward (dW)	16,384	15,872	4,096	16384x15872x4096

Total FLOP Budget

Each backward GEMM has the same FLOPs as its corresponding forward GEMM (the dimensions are just rearranged), and there are two per op, so:

Per layer: ~6.39T (fwd) + ~12.78T (bwd) = ~19.17 TFLOPS
Full model (24 layers): ~460 TFLOPS per step