export

ONNX export utilities.

Classes

FP8QuantExporter

Exporter for FP8 quantization.

INT4QuantExporter

Exporter for INT4 quantization.

INT8QuantExporter

Exporter for INT8 quantization.

MXFP8QuantExporter

Exporter for MXFP8 quantization.

NVFP4QuantExporter

Exporter for NVFP4 quantization.

ONNXQuantExporter

Base class for ONNX quantizer exporters.

class FP8QuantExporter

Bases: ONNXQuantExporter

Exporter for FP8 quantization.

static compress_weights(onnx_model)

Compresses the weights in the ONNX model for FP8 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static compute_scales(onnx_model)

Computes the scales for the weights in the ONNX model for FP8 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static post_process(onnx_model)

Post-processes the ONNX model for FP8 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static pre_process(onnx_model)

Pre-processes the ONNX model for FP8 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

class INT4QuantExporter

Bases: ONNXQuantExporter

Exporter for INT4 quantization.

static compress_weights(onnx_model)

Compresses the weights in the ONNX model for INT4 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static compute_scales(onnx_model)

Computes the scales for the weights in the ONNX model for INT4 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static post_process(onnx_model)

Post-processes the ONNX model for INT4 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static pre_process(onnx_model)

Pre-processes the ONNX model for INT4 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

class INT8QuantExporter

Bases: ONNXQuantExporter

Exporter for INT8 quantization.

static compress_weights(onnx_model)

Compresses the weights in the ONNX model for INT8 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static compute_scales(onnx_model)

Computes the scales for the weights in the ONNX model for INT8 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static post_process(onnx_model)

Post-processes the ONNX model for INT8 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static pre_process(onnx_model)

Pre-processes the ONNX model for INT8 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

class MXFP8QuantExporter

Bases: ONNXQuantExporter

Exporter for MXFP8 quantization.

static compress_weights(onnx_model)

Compresses the weights in the ONNX model to FP8 format for MXFP8 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static compute_scales(onnx_model)

Computes the e8m0 scales for weights in the ONNX model for MXFP8 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static post_process(onnx_model)

Post-processes the ONNX model for MXFP8 quantization.

Sets DQ output type to FP16 and updates GELU nodes to use tanh approximation.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static pre_process(onnx_model)

Pre-processes the ONNX model for MXFP8 quantization.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

class NVFP4QuantExporter

Bases: ONNXQuantExporter

Exporter for NVFP4 quantization.

Converts FP32/FP16 weights of an ONNX model to FP4 weights and scaling factors. TRT_FP4QDQ nodes will get removed from the weights and replaced with two DQ nodes with converted FP4 weights and scaling factors.

static compress_weights(onnx_model)

Compresses the weights in the ONNX model for NVFP4 quantization.

Converts weights to FP4 format and scales to FP8 format.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static compute_scales(onnx_model)

Computes the scales for the weights in the ONNX model for NVFP4 quantization.

Stores computed scales as node attributes for use in compress_weights.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static post_process(onnx_model)

Post-processes the ONNX model for NVFP4 quantization.

Replaces TRT_FP4QDQ nodes with two DequantizeLinear nodes and handles precision casting for inputs.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

static pre_process(onnx_model)

Pre-processes the ONNX model for NVFP4 quantization.

This is a no-op for NVFP4 quantization as no pre-processing is needed.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

class ONNXQuantExporter

Bases: ABC

Base class for ONNX quantizer exporters.

abstract static compress_weights(onnx_model)

Compresses the weights in the ONNX model.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

abstract static compute_scales(onnx_model)

Computes the scales for the weights in the ONNX model.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

abstract static post_process(onnx_model)

Post-processes the ONNX model.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

abstract static pre_process(onnx_model)

Pre-processes the ONNX model. Converts all DQ -> * -> op patterns to DQ -> op.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto

classmethod process_model(onnx_model)

Processes the ONNX model.

Parameters:

onnx_model (ModelProto)

Return type:

ModelProto