modelopt.onnx.quantization.quantize

quantize(onnx_path, quantize_mode='int8', calibration_data=None, calibration_method=None, calibration_cache_path=None, calibration_shapes=None, calibration_eps=['cuda:0', 'cpu', 'trt'], op_types_to_quantize=None, op_types_to_exclude=None, nodes_to_quantize=None, nodes_to_exclude=None, use_external_data_format=False, keep_intermediate_files=False, output_path=None, verbose=False, trt_plugins=None, trt_plugins_precision=None, high_precision_dtype=None, mha_accumulation_dtype='fp32', disable_mha_qdq=False, dq_only=True, block_size=None, use_zero_point=False, **kwargs)

Quantizes the provided ONNX model.

Parameters:
  • onnx_path (str) – Path to the input ONNX model.

  • quantize_mode (str) – Quantization mode. One of ‘int8’ (default), ‘int4’ and ‘fp8’.

  • calibration_data (ndarray | Dict[str, ndarray]) – Calibration data, either a numpy array or list/dict of numpy arrays.

  • calibration_method (str) – Calibration method choices. Options are int8: ‘entropy’ (default) and ‘max’, fp8: ‘max’ (default) and int4: ‘awq_clip’ (default), ‘awq_lite’, ‘awq_full’ and ‘rtn_dq’.

  • calibration_cache_path (str) – Path to pre-calculated activation tensor ranges, also known as calibration cache.

  • calibration_eps (List[str]) – Priority order for the execution providers (EP) to calibrate the model. Any subset of [‘cuda:x’, ‘cpu’, ‘trt’], where ‘x’ is the device id.

  • op_types_to_quantize (List[str]) – List of op types to quantize. If None (default), all supported operators are quantized. This flag does not support regular expression.

  • op_types_to_exclude (List[str]) – List of op types to exclude from quantization. This flag does not support regular expression.

  • nodes_to_quantize (List[str]) – List of node names to quantize. If None (default), all supported nodes are quantized. This flag supports regular expression.

  • nodes_to_exclude (List[str]) – List of node names to exclude from quantization. This flag supports regular expression.

  • use_external_data_format (bool) – If True, separate data path will be used to store the weights of the quantized model.

  • keep_intermediate_files (bool) – If True, keep all intermediate files generated during the ONNX model’s conversion/calibration.

  • output_path (str) – Output filename to save the quantized ONNX model. If None, save in the same directory as the original ONNX model with .quant suffix.

  • verbose (bool) – If True, print details of node partition, selection etc. throughout the quantization process.

  • trt_plugins (str) – Specifies custom TensorRT plugin library paths in .so format (compiled shared library). For multiple paths, separate them with a semicolon, i.e.: “lib_1.so;lib_2.so”. If this is not None or the model has custom ops, TensorrtExecutionProvider becomes the first choice as calibration execution provider, meaning that the TensorRT is a requirement.

  • trt_plugins_precision (List[str]) – A space-separated list indicating the precision for each custom op. Each item should have the format <op_type>:<precision>, where precision can be fp32 (default) or fp16. For example: op_type_1:fp16 op_type_2:fp32.

  • high_precision_dtype (str) – High precision data type, one of [‘fp32’, ‘fp16’]. If high_precision_dtype == ‘fp16’, model’s weight and activation will be converted to fp16.

  • mha_accumulation_dtype (str) – MHA accumulation dtype. One of [‘fp32’, ‘fp16’]. ‘fp32’ by default. If quantize_mode == ‘fp8’ and high_precision_dtype == ‘fp32’, Cast nodes will be added to MHA’s bmm1 and bmm2’s input and output tensors.

  • disable_mha_qdq (bool) – Don’t add Q/DQ layers to MatMuls in MHA pattern.

  • dq_only (bool) – If True (default), only add DQ nodes to the model. If False, add Q/DQ nodes to the model.

  • block_size (int | None) – Block size parameter for int4 quantization.

  • use_zero_point (bool) – Use zero-point based quantization, if True.

  • kwargs (Any) – Additional keyword arguments for int4 quantization, including: - awqlite_alpha_step (float): Alpha step for lite, range [0, 1]. - awqclip_alpha_step (float): Min alpha step for clip, range [awqclip_alpha_step, 1]. - awqclip_alpha_min (float): Alpha step to find best alpha for clip. - awqclip_bsz_col (int): Batch size for processing the column dimension in clip.

  • calibration_shapes (str) –

Returns:

None, writes the quantized onnx model in the supplied output_path or writes to the same directory with filename like “<model_name>.quant.onnx”.

Return type:

None