fp8

Performs FP8 GEMM only quantization of an ONNX model, and returns the ONNX ModelProto.

Functions

`int8_to_fp8`	Converts the INT8 quantized model to FP8 quantized model.
`quantize`	Applies FP8 GEMM only quantization to an ONNX file.
`upgrade_opset_21`	Modifies the ONNX graph such that it follows the opset 21 requirements.

int8_to_fp8(onnx_path)

Converts the INT8 quantized model to FP8 quantized model.

Note. This conversion works only for max calibrated INT8 models.

Parameters:: onnx_path (str) – Path to the INT8 quantized ONNX model.
Returns:: FP8 quantized ONNX model.
Return type:: ModelProto

quantize(onnx_path, calibration_method='max', calibration_data_reader=None, calibration_cache_path=None, calibration_shapes=None, calibration_eps=['cpu', 'cuda:0', 'trt'], op_types_to_quantize=None, op_types_to_exclude=None, nodes_to_quantize=None, nodes_to_exclude=None, use_external_data_format=False, intermediate_generated_files=[], trt_extra_plugin_lib_paths=None, high_precision_dtype='fp16', mha_accumulation_dtype='fp16', passes=['concat_elimination'], log_level='INFO', calibrate_per_node=False, **kwargs)

Applies FP8 GEMM only quantization to an ONNX file.

Currently, [‘Conv’, ‘Gemm’, ‘MatMul’, ‘Residual-Add’] quantization is supported.

Parameters:

onnx_path (str)
calibration_method (str)
calibration_data_reader (CalibrationDataReader)
calibration_cache_path (str | None)
calibration_shapes (str | None)
calibration_eps (list[str])
op_types_to_quantize (list[str] | None)
op_types_to_exclude (list[str] | None)
nodes_to_quantize (list[str] | None)
nodes_to_exclude (list[str] | None)
use_external_data_format (bool)
intermediate_generated_files (list[str])
trt_extra_plugin_lib_paths (list[str] | None)
high_precision_dtype (str)
mha_accumulation_dtype (str)
passes (list[str])
log_level (str)
calibrate_per_node (bool)

Return type:

ModelProto

upgrade_opset_21(onnx_model)

Modifies the ONNX graph such that it follows the opset 21 requirements.

This is necessary for FP8+FP16 quantization since FP8 QuantizeLinear/DequantizeLinear ops do not support FP16 scaling factors until opset 21.

Parameters:: onnx_model (ModelProto)
Return type:: ModelProto