Introduction

Transformer Engine accelerates deep learning on NVIDIA GPUs in several ways, with low precision training being one of the most important. This chapter introduces mixed precision training and FP8 support.

Training in BF16/FP16

Deep learning traditionally uses 32-bit floating-point (FP32) numbers. NVIDIA GPUs support lower precision formats—FP16 since Pascal, BF16 since Ampere—which offer higher throughput and lower memory usage. Let’s compare these formats.

Figure 1: Comparison of FP32, BF16, and FP16 floating-point formats showing bit allocation for sign, exponent, and mantissa.

The key differences between these formats are:

FP32 (32 bits total): 1 sign bit + 8 exponent bits + 23 mantissa bits – standard single-precision format
BF16 (16 bits total): 1 sign bit + 8 exponent bits + 7 mantissa bits – maintains FP32’s exponent range but has reduced precision
FP16 (16 bits total): 1 sign bit + 5 exponent bits + 10 mantissa bits – reduced range but higher precision than BF16

BF16’s advantage is that it shares the same exponent range as FP32, making it easier to convert between the two formats without overflow/underflow issues. FP16 offers better precision for smaller values but has a limited dynamic range, which results in the need to perform loss scaling to avoid overflow/underflow—see this paper on loss scaling for more details.

Mixed precision

Not all operations should be run in reduced precision to preserve accuracy. Modern deep learning frameworks use mixed precision training, where different operations use different precisions based on their numerical properties:

Matrix multiplications are compute-heavy and remain numerically stable at lower precision, making them ideal candidates for acceleration.
Operations like layer normalization and softmax can work with low precision inputs and outputs, but may use high precision internally or for their weights.
Operations like loss computation and exponentiation need high precision throughout.

Master weights

Another consideration in mixed precision training is how to store the model weights. Lower precision formats like FP16 and BF16 have limited representational granularity, which becomes problematic during gradient updates. When a small gradient is added to a not so small weight stored in low precision, the result may round back to the original value if the update falls below the format’s precision threshold. Moreover, some elements of the gradient itself can be too small to be represented in low precision, especially after the accumulation from multiple GPUs in the data parallel training setting.

The solution is to maintain master weights in FP32. During training, weights are cast to lower precision for forward and backward passes, but the gradient updates are applied to the full-precision master copy. This ensures that even small gradients accumulate correctly over time.

There are two common software approaches to storing master weights:

In the optimizer: The model holds low-precision weights, while the optimizer maintains FP32 copies alongside momentum and other state. During each step, the optimizer updates its FP32 copy and casts the result back to the model’s low-precision weights.

This approach makes it easier to shard master weights together with other optimizer state, for example in ZeRO optimizer.

Since the casting happens only during the optimizer step, this approach is also faster when optimizer runs less frequently than the model, e.g. when performing gradient accumulation or pipeline parallel training.
In the model: The model stores weights directly in FP32, and they are cast to lower precision on-the-fly during forward and backward passes. This approach works seamlessly with any standard optimizer, requiring no special support.

Figure 2: Three approaches to weight storage—low precision only (no master weights), master weights stored in the model, and master weights stored in the optimizer.

The PyTorch API of Transformer Engine provides several mechanisms to control precision:

Weight precision: Use the params_dtype argument in any TE layer constructor.
Computation precision: Use the torch.autocast context manager. When enabled, inputs are cast to the autocast dtype before computation.
Input dtype: When torch.autocast is not used, the input tensor’s dtype determines the computation precision. In this case, inputs and parameters must have matching dtypes.

import torch
import transformer_engine.pytorch as te
from contextlib import nullcontext


def run_forward_backward(params_dtype, autocast_precision, grad_scaler_enabled):
    if grad_scaler_enabled:
        grad_scaler = torch.amp.GradScaler("cuda")

    layer = te.TransformerLayer(
        hidden_size=1024,
        ffn_hidden_size=4096,
        num_attention_heads=16,
        params_dtype=params_dtype,
    )
    x = torch.randn(32, 128, 1024, dtype=params_dtype, device="cuda")

    autocast_ctx = (
        torch.autocast(device_type="cuda", dtype=autocast_precision)
        if autocast_precision is not None
        else nullcontext()
    )
    with autocast_ctx:
        output = layer(x)
        assert (
            output.dtype == autocast_precision if autocast_precision is not None else params_dtype
        )
        loss = output.sum()
    if grad_scaler_enabled:
        grad_scaler.scale(loss).backward()
    else:
        loss.backward()


run_forward_backward(torch.float32, torch.float32, False)  # high precision training
run_forward_backward(
    torch.float32, torch.bfloat16, False
)  # bfloat16 training with master weights in FP32
run_forward_backward(
    torch.float32, torch.float16, True
)  # fp16 training with master weights in FP32, needs loss scaling
run_forward_backward(
    torch.bfloat16, torch.bfloat16, False
)  # bfloat16 training with weights in BF16

The JAX API of Transformer Engine provides two mechanisms to control precision:

Weight precision: Use the dtype argument in any TE layer constructor.
Computation precision: Determined by the dtype of the input tensor.

For training with master weights in FP32 and computation in BF16, cast the input tensor to BF16 before passing it to the layer.

import jax
import jax.numpy as jnp
from transformer_engine.jax.flax import TransformerLayer


def run_forward_backward(params_dtype, compute_dtype):
    # Create TransformerLayer
    layer = TransformerLayer(
        hidden_size=1024,
        mlp_hidden_size=4096,
        num_attention_heads=16,
        dtype=params_dtype,
    )

    # Initialize parameters and optimizer
    init_key, dropout_key = jax.random.split(jax.random.PRNGKey(0))
    x = jax.random.normal(init_key, (32, 128, 1024), dtype=compute_dtype)
    var_collect = layer.init({"params": init_key, "dropout": dropout_key}, x)

    # Forward and backward pass
    def loss_fn(var_collect):
        output = layer.apply(var_collect, x, rngs={"dropout": dropout_key})
        assert output.dtype == compute_dtype
        return output.sum()

    loss, grads = jax.value_and_grad(loss_fn)(var_collect)


run_forward_backward(jnp.float32, jnp.float32)  # high precision training
run_forward_backward(jnp.float32, jnp.bfloat16)  # bfloat16 training with master weights in FP32
run_forward_backward(jnp.bfloat16, jnp.bfloat16)  # bfloat16 training with weights in BF16

Lower precisions

Transformer Engine’s primary feature is supporting even lower precision than BF16/FP16, such as FP8, MXFP8, NVFP4, etc. The logic of these precisions is more complicated than the logic of BF16/FP16 – they require scaling factors to properly represent the full range of values in the tensor. Sometimes it is one scaling factor per tensor, sometimes it is one scaling factor per block of values. A precision format combined with the logic for training is called a recipe.

In this section we present common logic for all the recipes. Each one of them is described in more detail in a separate section later. Let’s now see how we can train in lower precisions in supported frameworks.

The PyTorch API of Transformer Engine provides an autocast context manager to control precision. It’s similar to the torch.autocast context manager, but tailored for low precision training. The most important argument is the recipe argument, which accepts objects inheriting from Recipe.

Forward computations need to be performed inside the autocast context manager, while the .backward() call should be outside of it (it inherits the setting from the corresponding forward pass).

Here is a basic example: