modelopt.onnx.quantization.quantize

quantize(onnx_path, quantize_mode='int8', calibration_data=None, calibration_method=None, calibration_cache_path=None, calibration_shapes=None, calibration_eps=['cpu', 'cuda:0', 'trt'], override_shapes=None, op_types_to_quantize=None, op_types_to_exclude=None, nodes_to_quantize=None, nodes_to_exclude=None, use_external_data_format=False, keep_intermediate_files=False, output_path=None, log_level='INFO', log_file=None, trt_plugins=None, trt_plugins_precision=None, high_precision_dtype=None, mha_accumulation_dtype='fp16', disable_mha_qdq=False, dq_only=True, block_size=None, use_zero_point=False, passes=['concat_elimination'], simplify=False, calibrate_per_node=False, **kwargs)

Quantizes the provided ONNX model.

Parameters:
  • onnx_path (str) – Path to the input ONNX model.

  • quantize_mode (str) – Quantization mode. One of ‘int8’ (default), ‘int4’ and ‘fp8’.

  • calibration_data (ndarray | dict[str, ndarray]) – Calibration data, either a numpy array or list/dict of numpy arrays.

  • calibration_method (str | None) – Calibration method choices. Options are int8/fp8: {‘entropy’ (default), ‘max’} and int4: {‘awq_clip’ (default), ‘awq_lite’, ‘awq_full’, ‘rtn_dq’}.

  • calibration_cache_path (str | None) – Path to pre-calculated activation tensor ranges, also known as calibration cache.

  • calibration_shapes (str | None) – Input shapes used for calibration process.

  • calibration_eps (list[str]) –

    Priority order for the execution providers (EP) to calibrate the model. Any subset of [‘trt’, ‘cuda:x’, ‘dml:x’, ‘cpu’], where ‘x’ is the device id.

    Note

    If a custom op is detected in the model, ‘trt’ will automatically be added to the EP list.

  • override_shapes (str | None) – Override model input shapes with static shapes.

  • op_types_to_quantize (list[str] | None) – List of op types to quantize. If None (default), all supported operators are quantized. This flag does not support regular expression.

  • op_types_to_exclude (list[str] | None) – List of op types to exclude from quantization. This flag does not support regular expression.

  • nodes_to_quantize (list[str] | None) – List of node names to quantize. If None (default), all supported nodes are quantized. This flag supports regular expression.

  • nodes_to_exclude (list[str] | None) – List of node names to exclude from quantization. This flag supports regular expression.

  • use_external_data_format (bool) – If True, separate data path will be used to store the weights of the quantized model.

  • keep_intermediate_files (bool) – If True, keep all intermediate files generated during the ONNX model’s conversion/calibration.

  • output_path (str | None) – Output filename to save the quantized ONNX model. If None, save in the same directory as the original ONNX model with .quant suffix.

  • log_level (str) – Log level. One of ‘DEBUG’, ‘INFO’, ‘WARNING’, ‘ERROR’.

  • log_file (str | None) – Path to the log file for the quantization process.

  • trt_plugins (list[str] | None) – A space-separated list with the custom TensorRT plugin library paths in .so format (compiled shared library). If this is not None or the model has custom ops, TensorrtExecutionProvider becomes the first choice as calibration execution provider, meaning that the TensorRT is a requirement.

  • trt_plugins_precision (list[str] | None) – A space-separated list indicating the precision for each custom op. Each item should have the format <op_type>:<precision>, where precision can be fp32 (default) or fp16. For example: op_type_1:fp16 op_type_2:fp32.

  • high_precision_dtype (str | None) – High precision data type, one of [‘fp32’, ‘fp16’]. If high_precision_dtype == ‘fp16’, model’s weight and activation will be converted to fp16.

  • mha_accumulation_dtype (str) – MHA accumulation dtype. One of [‘fp32’, ‘fp16’]. ‘fp16’ by default. If quantize_mode == ‘fp8’ and mha_accumulation_dtype == ‘fp32’, Cast nodes will be added to MHA’s bmm1 and bmm2’s input and output tensors.

  • disable_mha_qdq (bool) – Don’t add Q/DQ layers to MatMuls in MHA pattern.

  • dq_only (bool) – If True (default), only add DQ nodes to the model. If False, add Q/DQ nodes to the model.

  • block_size (int | None) – Block size parameter for int4 quantization.

  • use_zero_point (bool) – Use zero-point based quantization, if True.

  • passes (list[str]) – List of optimization passes name, if set, appropriate pre/post-processing passes will be invoked.

  • simplify (bool) – Simplify the given model before quantization.

  • calibrate_per_node (bool) – Calibrate the model node by node instead of calibrating the entire model. This allowes calibration with a lower system memory with the cost of longer calibration time.

  • kwargs (Any) – Additional keyword arguments for int4 quantization, including: - awqlite_alpha_step (float): Alpha step for lite, range [0, 1]. - awqclip_alpha_step (float): Min alpha step for clip, range [awqclip_alpha_step, 1]. - awqclip_alpha_min (float): Alpha step to find best alpha for clip. - awqclip_bsz_col (int): Batch size for processing the column dimension in clip.

Returns:

None, writes the quantized onnx model in the supplied output_path or writes to the same directory with filename like “<model_name>.quant.onnx”.

Return type:

None