Common API

class transformer_engine.common.recipe.Format(*args, **kwds)

Supported FP8 formats. Supported FP4 formats.

Values:

E2M1 – All FP4 tensors are in e2m1 format
E4M3 – All FP8 tensors are in e4m3 format
E5M2 – All FP8 tensors are in e5m2 format
HYBRID – FP8 tensors in the forward pass are in e4m3 format, FP8 tensors in the backward pass are in e5m2 format

class transformer_engine.common.recipe.DelayedScaling(margin=0, fp8_format=Format.HYBRID, amax_history_len=1024, amax_compute_algo='max', scaling_factor_compute_algo=None)

Use the delayed scaling factor strategy. Use scale factor from previous iteration and record amax history of amax_history_len steps.

Parameters:

margin (int, default = 0) – Margin for the scaling factor computation.
fp8_format ({Format.E4M3, Format.HYBRID}, default = Format.HYBRID) – Controls the FP8 data format used during forward and backward pass.
amax_history_len (int, default = 1024) – The length of the amax history window used for scaling factor computation.
amax_compute_algo ({'max', 'most_recent', Callable}, default = 'max') –
Algorithm used for choosing the amax value for the scaling factor computation. There are 2 predefined choices: max chooses the largest amax in the history window, while most_recent always chooses the most recently seen value. Alternatively, one may pass a function of the signature:
```
def amax_compute(amax_history: Tensor) -> Tensor
```
where Tensor is a framework tensor type.

scaling_factor_compute_algo (Callable, default = None) –

Algorithm used for computing the new scaling factor based on the value of amax. It should be a function of the signature:

def scaling_factor_compute(amax: Tensor,
                           old_scaling_factor: Tensor,
                           fp8_max: Tensor,
                           recipe: DelayedScaling) -> Tensor

where Tensor is a framework tensor type.

reduce_amax (bool, default = True) – By default, if torch.distributed is initialized, the amax value for FP8 tensors is reduced across the amax_reduction_group (specified in the autocast call). This keeps the amaxes and scaling factors synced across the given distributed group. If set to False, this reduction is skipped and every GPU maintains local amaxes and scaling factors. To ensure results are numerically identical across checkpointing boundaries in this case, all ranks must checkpoint in order to store the local tensors.
fp8_dpa (bool, default = False) – Whether to enable FP8 dot product attention (DPA). When the model is placed in an autocast(enabled=True) region and fp8_dpa is set to True, DPA casts the inputs from higher precision to FP8, performs attention in FP8, and casts tensors back to higher precision as outputs. FP8 DPA currently is only supported in the FusedAttention backend.
fp8_mha (bool, default = False) – Whether to enable FP8 multi-head attention (MHA). When True, it removes the casting operations mentioned above at the DPA boundaries. Currently only standard MHA modules i.e. LayerNormLinear/Linear + DPA + Linear, are supported for this feature. When fp8_mha = False, fp8_dpa = True, a typical MHA module works as LayerNormLinear (BF16 output) -> (cast to FP8 ) FP8 DPA (cast to BF16) -> Linear. When fp8_mha = True, fp8_dpa = True, it becomes LayerNormLinear (FP8 output) -> FP8 DPA -> Linear.
backward_override ({None, 'high_precision', 'dequantized'}, default = None) – Backward precision mode. Delayed scaling only supports None.

Notes

By default (when scaling_factor_compute_algo is left as None) the scaling factor is computed from the final amax value using the formula:
```
FP8_MAX = maximum_representable_value(fp8_format)
new_scaling_factor = (FP8_MAX / amax) / (2 ^ margin)
```
fp8_dpa and fp8_mha are Beta features, and their API and functionality are subject to change in future Transformer Engine releases.

class transformer_engine.common.recipe.MXFP8BlockScaling(fp8_format=Format.E4M3)

Use the MXFP8 scaling factor strategy.

In this strategy, tensors are scaled in blockwise fashion. Each group of 32 consecutive values is scaled together using their own scaling factor. The type of the scaling factor is E8M0 (8 bits of exponent, 0 bits of mantissa), equivalent to scaling by a power of 2.

Since the scaling happens in a particular direction (either rowwise or columnwise), in this recipe the quantized tensor and its transpose are not numerically equivalent. Due to this, when Transformer Engine needs both the MXFP8 tensor and its transpose (e.g. to calculate both forward and backward pass), during the quantization both versions are computed from the high precision input to avoid double quantization errors.

Parameters:

fp8_format ({Format.E4M3, Format.HYBRID}, default = Format.E4M3) – Controls the FP8 data format used during forward and backward pass.
backward_override ({None, 'high_precision', 'dequantized'}, default = None) – Backward precision mode. None does not modify backward behavior, high_precision keeps original high-precision operands for backward, and dequantized dequantizes saved operands to the active high-precision compute dtype (e.g. BF16/FP16/FP32) for backward.

class transformer_engine.common.recipe.NVFP4BlockScaling(fp4_format=Format.E2M1)

Use the NVFP4 scaling strategy.

This is a 2-level block scaling strategy. In level 1, each group of 16 consecutive values is scaled together using their own scaling factor. The type of the scaling factor is E4M3 (4 bits of exponent, 3 bits of mantissa). In level 2, a global per tensor FP32 scaling factor is used to scale the entire tensor.

Since the scaling happens in a particular direction (either rowwise or columnwise), in this recipe the quantized tensor and its transpose are not numerically equivalent. Due to this, when Transformer Engine needs both the tensor and its transpose (e.g. to calculate both forward and backward pass), during the quantization both versions are computed from the high precision input to avoid double quantization errors.

The default NVFP4 training recipe implements 3 techniques for quantizing to a narrow format (4-bit):

For weight tensors a variant of the NVFP4 quantization is used, where a single scaling factor is shared by a 2D block of 16x16 elements.
When quantizing gradients, stochastic rounding is applied to avoid the bias introduced by quantization. With this, values are rounded probabilistically to one of their two nearest representable numbers, with probabilities inversely proportional to their distances.
When quantizing inputs and gradients, random Hadamard transforms are applied (16x16 Hadamard matrix) to smooth outliers in the tensor distributions and make them easier to represent accurately in NVFP4.

These techniques are described more comprehensively in the NVFP4 paper titled ‘Pretraining Large Language Models with NVFP4’ (https://arxiv.org/abs/2509.25149v1).

Parameters:

fp4_format ({Format.E2M1}, default = Format.E2M1) – FP4 data type.
disable_rht (bool, default = False) – If set to True, random Hadamard transforms are not applied to any tensor.
disable_stochastic_rounding (bool, default = False) – If set to True, stochastic rounding is disabled during quantization for all tensors.
disable_2d_quantization (bool, default = False) – If set to True, 1D block scaling with block size 16 is used for all tensors.
row_scaled_activation (bool, default = False) – If set to True, forward activation quantizers emit row-scaled NVFP4 tensors. In this mode, rowwise amax metadata is stored as a vector with one FP32 value per tensor row.
nvfp4_4over6 ({'none', 'weights', 'activations', 'all'}, default = 'none') – Enable 4over6 adaptive NVFP4 block scaling for selected tensor scopes. For each selected FP4 block, quantization compares map-to-4 and map-to-6 candidates and stores the candidate with lower configured error. Current 4over6 support targets RL and post-training scenarios; pre-training paths that combine 4over6 with RHT are not yet implemented.
nvfp4_4over6_e4m3_use_256 ({'none', 'weights', 'activations', 'all'}, default = 'all') – Select 4over6 tensors that use 256 as the global E4M3 scale bound. By default, all 4over6 tensors use 256. Use 'none' to keep the standard NVFP4 448 bound for 4over6 tensors.
nvfp4_4over6_err_mode ({'MAE', 'MSE'}, default = 'MAE') – Error metric used by NVFP4 4over6 candidate selection.
backward_override ({None, 'high_precision', 'dequantized'}, default = None) – Backward precision mode. None does not modify backward behavior, high_precision keeps original high-precision operands for backward, and dequantized dequantizes saved operands to the active high-precision compute dtype (e.g. BF16/FP16/FP32) for backward.

class transformer_engine.common.recipe.Float8CurrentScaling(fp8_format=Format.HYBRID)

Use the per-tensor current scaling factor strategy.

Parameters:

fp8_format ({Format.E4M3, Format.HYBRID}, default = Format.HYBRID) – Controls the FP8 data format used during forward and backward pass.
backward_override ({None, 'high_precision', 'dequantized'}, default = None) – Backward precision mode. None does not modify backward behavior, high_precision keeps original high-precision operands for backward, and dequantized dequantizes saved operands to the active high-precision compute dtype (e.g. BF16/FP16/FP32) for backward.

class transformer_engine.common.recipe.Float8BlockScaling(fp8_format=Format.E4M3)

Use block-wise scaling for FP8 tensors.

In this strategy, tensors are scaled in blockwise fashion. Values within each block share a common scaling factor. The block dimensionality can be configured. The scaling factors are float32 containers. They will by default be constrained to powers of 2.

Since the scaling happens in a particular direction (either rowwise or columnwise), the quantized tensor and its transpose are not numerically equivalent. Due to this, when Transformer Engine needs both the FP8 tensor and its transpose (e.g. to calculate both forward and backward pass), during the quantization both versions are computed from the high precision input to avoid double quantization errors.

NOTE: To relax the default constraint that scales be powers of 2, set env variable NVTE_FP8_BLOCK_SCALING_FP32_SCALES=1 to override it for the recipe defaults.

Parameters:

fp8_format ({Format.E4M3, Format.HYBRID}, default = Format.E4M3) – Controls the FP8 data format used during forward and backward pass.
backward_override ({None, 'high_precision', 'dequantized'}, default = None) – Backward precision mode. None does not modify backward behavior, high_precision keeps original high-precision operands for backward, and dequantized dequantizes saved operands to the active high-precision compute dtype (e.g. BF16/FP16/FP32) for backward.

class transformer_engine.common.recipe.CustomRecipe(qfactory, fp8_dpa=False, fp8_mha=False)

Custom recipe that allows users to provide quantizer factories.

Warning

EXPERIMENTAL: Custom recipe is experimental, still under active development, and the API is subject to change without notice. Use at your own risk.

Parameters:

qfactory (Callable) –
Factory callable that returns a quantizer instance or a QuantizerRequest subclass for a given QuantizerRole. The callable is invoked as:
```
qfactory(
    role: QuantizerRole,
) -> Union[Quantizer, QuantizerRequest]
```
QuantizerRole is a frozen dataclass with the following fields:
- module_type (str): module type (empty string when not set), e.g. "linear", "grouped_linear", "dpa".
- tensor_type (str): what tensor is being quantized (empty string when not set), e.g. "input", "weight", "grad_output".
- name (str): caller-provided module instance name (empty string when not set), e.g. "qkv", "proj", "fc1", "fc2".
For stateful quantizers (delayed scaling), return a DelayedScalingRequest dataclass instead of a quantizer. TE will allocate shared scale/amax_history buffers and create Float8Quantizer instances integrated with the existing delayed-scaling reduction infrastructure.

See transformer_engine.pytorch.quantization.QuantizerRole and transformer_engine.pytorch.quantization.DelayedScalingRequest for full documentation.
backward_override ({None, 'high_precision', 'dequantized'}, default = None) – Backward precision mode. None does not modify backward behavior, high_precision keeps original high-precision operands for backward, and dequantized dequantizes saved operands to the active high-precision compute dtype (e.g. BF16/FP16/FP32) for backward.