export
ONNX export utilities.
Classes
Exporter for FP8 quantization. |
|
Exporter for INT4 quantization. |
|
Exporter for INT8 quantization. |
|
Exporter for MXFP8 quantization. |
|
Exporter for NVFP4 quantization. |
|
Base class for ONNX quantizer exporters. |
- class FP8QuantExporter
Bases:
ONNXQuantExporterExporter for FP8 quantization.
- static compress_weights(onnx_model)
Compresses the weights in the ONNX model for FP8 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static compute_scales(onnx_model)
Computes the scales for the weights in the ONNX model for FP8 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static post_process(onnx_model)
Post-processes the ONNX model for FP8 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static pre_process(onnx_model)
Pre-processes the ONNX model for FP8 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- class INT4QuantExporter
Bases:
ONNXQuantExporterExporter for INT4 quantization.
- static compress_weights(onnx_model)
Compresses the weights in the ONNX model for INT4 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static compute_scales(onnx_model)
Computes the scales for the weights in the ONNX model for INT4 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static post_process(onnx_model)
Post-processes the ONNX model for INT4 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static pre_process(onnx_model)
Pre-processes the ONNX model for INT4 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- class INT8QuantExporter
Bases:
ONNXQuantExporterExporter for INT8 quantization.
- static compress_weights(onnx_model)
Compresses the weights in the ONNX model for INT8 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static compute_scales(onnx_model)
Computes the scales for the weights in the ONNX model for INT8 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static post_process(onnx_model)
Post-processes the ONNX model for INT8 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static pre_process(onnx_model)
Pre-processes the ONNX model for INT8 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- class MXFP8QuantExporter
Bases:
ONNXQuantExporterExporter for MXFP8 quantization.
- static compress_weights(onnx_model)
Compresses the weights in the ONNX model to FP8 format for MXFP8 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static compute_scales(onnx_model)
Computes the e8m0 scales for weights in the ONNX model for MXFP8 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static post_process(onnx_model)
Post-processes the ONNX model for MXFP8 quantization.
Sets DQ output type to FP16 and updates GELU nodes to use tanh approximation.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static pre_process(onnx_model)
Pre-processes the ONNX model for MXFP8 quantization.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- class NVFP4QuantExporter
Bases:
ONNXQuantExporterExporter for NVFP4 quantization.
Converts FP32/FP16 weights of an ONNX model to FP4 weights and scaling factors. TRT_FP4QDQ nodes will get removed from the weights and replaced with two DQ nodes with converted FP4 weights and scaling factors.
- static compress_weights(onnx_model)
Compresses the weights in the ONNX model for NVFP4 quantization.
Converts weights to FP4 format and scales to FP8 format.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static compute_scales(onnx_model)
Computes the scales for the weights in the ONNX model for NVFP4 quantization.
Stores computed scales as node attributes for use in compress_weights.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static post_process(onnx_model)
Post-processes the ONNX model for NVFP4 quantization.
Replaces TRT_FP4QDQ nodes with two DequantizeLinear nodes and handles precision casting for inputs.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- static pre_process(onnx_model)
Pre-processes the ONNX model for NVFP4 quantization.
This is a no-op for NVFP4 quantization as no pre-processing is needed.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- class ONNXQuantExporter
Bases:
ABCBase class for ONNX quantizer exporters.
- abstract static compress_weights(onnx_model)
Compresses the weights in the ONNX model.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- abstract static compute_scales(onnx_model)
Computes the scales for the weights in the ONNX model.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- abstract static post_process(onnx_model)
Post-processes the ONNX model.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- abstract static pre_process(onnx_model)
Pre-processes the ONNX model. Converts all DQ -> * -> op patterns to DQ -> op.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto
- classmethod process_model(onnx_model)
Processes the ONNX model.
- Parameters:
onnx_model (ModelProto)
- Return type:
ModelProto