export

ONNX export utilities.

Classes

`FP8QuantExporter`	Exporter for FP8 quantization.
`INT4QuantExporter`	Exporter for INT4 quantization.
`INT8QuantExporter`	Exporter for INT8 quantization.
`MXFP8QuantExporter`	Exporter for MXFP8 quantization.
`NVFP4QuantExporter`	Exporter for NVFP4 quantization.
`ONNXQuantExporter`	Base class for ONNX quantizer exporters.

class FP8QuantExporter

Bases: ONNXQuantExporter

Exporter for FP8 quantization.

static compress_weights(onnx_model)

Compresses FP32/FP16 weights to FP8 by folding QDQ nodes to DQ only.

Even though modelopt supports FP8 onnx export, the weights are represented in fp32 + QDQ. The storage is therefore very bad. In this function, Q nodes will get removed from the weights and have only DQ nodes with those converted FP8 weights in the output model.

Parameters:: onnx_model (ModelProto) – ONNX model with FP32/FP16 weights and QDQ nodes.
Returns:: ONNX model with FP8 weights and only DQ nodes for weights (QDQ preserved for activations).
Return type:: ModelProto

static compute_scales(onnx_model)

Computes the scales for the weights in the ONNX model for FP8 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static post_process(onnx_model)

Post-processes the ONNX model for FP8 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static pre_process(onnx_model)

Pre-processes the ONNX model for FP8 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

class INT4QuantExporter

Bases: ONNXQuantExporter

Exporter for INT4 quantization.

static compress_weights(onnx_model)

Compresses the weights in the ONNX model for INT4 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static compute_scales(onnx_model)

Computes the scales for the weights in the ONNX model for INT4 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static post_process(onnx_model)

Post-processes the ONNX model for INT4 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static pre_process(onnx_model)

Pre-processes the ONNX model for INT4 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

class INT8QuantExporter

Bases: ONNXQuantExporter

Exporter for INT8 quantization.

static compress_weights(onnx_model)

Compresses the weights in the ONNX model for INT8 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static compute_scales(onnx_model)

Computes the scales for the weights in the ONNX model for INT8 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static post_process(onnx_model)

Post-processes the ONNX model for INT8 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static pre_process(onnx_model)

Pre-processes the ONNX model for INT8 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

class MXFP8QuantExporter

Bases: ONNXQuantExporter

Exporter for MXFP8 quantization.

static compress_weights(onnx_model)

Compresses the weights in the ONNX model to FP8 format for MXFP8 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static compute_scales(onnx_model)

Computes the e8m0 scales for weights in the ONNX model for MXFP8 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static post_process(onnx_model)

Post-processes the ONNX model for MXFP8 quantization.

Sets DQ output type to FP16 and updates GELU nodes to use tanh approximation.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static pre_process(onnx_model)

Pre-processes the ONNX model for MXFP8 quantization.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

class NVFP4QuantExporter

Bases: ONNXQuantExporter

Exporter for NVFP4 quantization.

Converts FP32/FP16 weights of an ONNX model to FP4 weights and scaling factors. TRT_FP4QDQ nodes will get removed from the weights and replaced with two DQ nodes with converted FP4 weights and scaling factors.

static compress_weights(onnx_model)

Compresses the weights in the ONNX model for NVFP4 quantization.

Converts weights to FP4 format and scales to FP8 format.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static compute_scales(onnx_model)

Computes the scales for the weights in the ONNX model for NVFP4 quantization.

Stores computed scales as node attributes for use in compress_weights.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static post_process(onnx_model)

Post-processes the ONNX model for NVFP4 quantization.

Replaces TRT_FP4QDQ nodes with two DequantizeLinear nodes and handles precision casting for inputs.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

static pre_process(onnx_model)

Pre-processes the ONNX model for NVFP4 quantization.

This is a no-op for NVFP4 quantization as no pre-processing is needed.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

class ONNXQuantExporter

Bases: ABC

Base class for ONNX quantizer exporters.

abstract static compress_weights(onnx_model)

Compresses the weights in the ONNX model.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

abstract static compute_scales(onnx_model)

Computes the scales for the weights in the ONNX model.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

abstract static post_process(onnx_model)

Post-processes the ONNX model.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

abstract static pre_process(onnx_model)

Pre-processes the ONNX model. Converts all DQ -> * -> op patterns to DQ -> op.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto

classmethod process_model(onnx_model)

Processes the ONNX model.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto