quantize

Convert ONNX model without QDQ nodes + calib data into ONNX model with QDQ nodes.

Typically quantizing linear operations like Conv, MatMul etc. gives most of the performance boost. But there are many other ops that are quantizable (aka low precision kernels available) and provides optimal performance with lower accuracy drop. The default op types that this ONNX ptq tool quantizes in different quantization modes are: INT8: [‘Add’, ‘AveragePool’, ‘BatchNormalization’, ‘Clip’, ‘Conv’, ‘ConvTranspose’, ‘Gemm’, ‘GlobalAveragePool’, ‘MatMul’, ‘MaxPool’, ‘Mul’], INT4: [‘Gemm’, ‘MatMul’], FP8: [‘Conv’, ‘Gemm’, ‘MatMul’]. The tool inserts QDQ nodes following compiler friendly patterns and generates an explicit ONNX model.

Functions

quantize

Quantize the given onnx model.

quantize(onnx_path, calibration_data=None, calibration_method=None, calibration_cache_path=None, op_types_to_quantize=None, op_types_to_exclude=None, nodes_to_quantize=None, nodes_to_exclude=None, use_external_data_format=False, keep_intermediate_files=False, output_path=None, verbose=False, quantize_mode='int8')

Quantize the given onnx model.

Parameters:
  • onnx_path (str) – Path to the input onnx model.

  • calibration_data (ndarray | Dict[str, ndarray]) – Calibration data, either a numpy array or list/dict of numpy array.

  • calibration_method (str) – Calibration method choices for int8, options={entropy (default), minmax}.

  • calibration_cache_path (str) – Pre-calculated activation tensor ranges aka calibration cache path.

  • op_types_to_quantize (List[str]) – List of types of operators to quantize. When this list is not None, only the types in this list are quantized. Example: [‘Conv’] indicates that only ops of type ‘Conv’ should be quantized. If this list is None (default), all supported operators are quantized. This flag does not support regular expression.

  • op_types_to_exclude (List[str]) – List of types of operators to exclude from quantization. This flag does not support regular expression.

  • nodes_to_quantize (List[str]) – List of node names to quantize. When this list is not None, only the nodes in this list are quantized. Example: [‘Conv__224’, ‘Conv__252’]. If this list is None (default), all supported nodes are quantized. This flag does not support regular expression.

  • nodes_to_exclude (List[str]) – List of nodes names to exclude. The nodes in this list will be excluded from quantization when it is not None. This flag supports regular expression.

  • use_external_data_format (bool) – If not None, this path will be used to store the weights of the quantized model.

  • keep_intermediate_files (bool) –

    If False, only save the converted ONNX files for the user. Otherwise, keep all intermediate files

    generated during the ONNX models’ conversion/calibration.

  • output_path (str) – Output filename to save the converted ONNX model. If None, save in the same directory as the original ONNX model with .quant suffix.

  • verbose (bool) – Prints details of node partition, selection etc. throughout the quantization process.

  • quantize_mode (str) – Quantization mode. One of [‘int8’, ‘int4_rtn’, ‘int4_rtn_dq’, ‘int4_rtn_trt’, ‘int4_rtn_trt_dq’, ‘int4_awq_clip’, ‘int4_awq_clip_trt’, ‘fp8’]. ‘int8’ by default. Any INT4-based mode is Gemm, MatMul weight-only and FP8 mode is Conv, Gemm and MatMul only quantization.

Returns:

None, write the quantized onnx model in the same directory with filename like “<model_name>.quant.onnx”.

Return type:

None