fp8

Performs FP8 GEMM only quantization of an ONNX model, and returns the ONNX ModelProto.

Functions

int8_to_fp8

Converts the INT8 quantized model to FP8 quantized model.

quantize

Applies FP8 GEMM only quantization to an ONNX file.

int8_to_fp8(onnx_path, verbose=False)

Converts the INT8 quantized model to FP8 quantized model.

Note. This conversion works only for max calibrated INT8 models.

Parameters:
  • onnx_path (str) – Path to the INT8 quantized ONNX model.

  • verbose (bool) – Whether to print verbose logs or not.

Returns:

FP8 quantized ONNX model.

Return type:

ModelProto

quantize(onnx_path, calibration_method='max', calibration_data_reader=None, calibration_cache_path=None, calibration_shapes=None, calibration_eps=['cuda:0', 'cpu', 'trt'], op_types_to_quantize=None, op_types_to_exclude=None, nodes_to_quantize=None, nodes_to_exclude=None, use_external_data_format=True, intermediate_generated_files=[], verbose=False, trt_extra_plugin_lib_paths=None, high_precision_dtype='fp16', mha_accumulation_dtype='fp32')

Applies FP8 GEMM only quantization to an ONNX file.

Currently, [‘Conv’, ‘Gemm’, ‘MatMul’] quantization is supported.

Parameters:
  • onnx_path (str) –

  • calibration_method (str) –

  • calibration_data_reader (CalibrationDataReader) –

  • calibration_cache_path (str) –

  • calibration_shapes (str) –

  • calibration_eps (List[str]) –

  • op_types_to_quantize (List[str]) –

  • op_types_to_exclude (List[str]) –

  • nodes_to_quantize (List[str]) –

  • nodes_to_exclude (List[str]) –

  • use_external_data_format (bool) –

  • intermediate_generated_files (List[str]) –

  • verbose (bool) –

  • trt_extra_plugin_lib_paths (str) –

  • high_precision_dtype (str) –

  • mha_accumulation_dtype (str) –

Return type:

ModelProto