modelopt.onnx.quantization.quantize
- quantize(onnx_path, quantize_mode='int8', calibration_data=None, calibration_method=None, calibration_cache_path=None, calibration_shapes=None, calibration_eps=['cpu'], op_types_to_quantize=None, op_types_to_exclude=None, nodes_to_quantize=None, nodes_to_exclude=None, use_external_data_format=False, keep_intermediate_files=False, output_path=None, verbose=False, trt_plugins=None, trt_plugins_precision=None, high_precision_dtype=None, mha_accumulation_dtype='fp16', disable_mha_qdq=False, dq_only=True, block_size=None, use_zero_point=False, **kwargs)
Quantizes the provided ONNX model.
- Parameters:
onnx_path (str) – Path to the input ONNX model.
quantize_mode (str) – Quantization mode. One of ‘int8’ (default), ‘int4’ and ‘fp8’.
calibration_data (ndarray | dict[str, ndarray]) – Calibration data, either a numpy array or list/dict of numpy arrays.
calibration_method (str) – Calibration method choices. Options are int8: ‘entropy’ (default) and ‘max’, fp8: ‘max’ (default) and int4: ‘awq_clip’ (default), ‘awq_lite’, ‘awq_full’ and ‘rtn_dq’.
calibration_cache_path (str) – Path to pre-calculated activation tensor ranges, also known as calibration cache.
calibration_eps (list[str]) –
Priority order for the execution providers (EP) to calibrate the model. Any subset of [‘trt’, ‘cuda:x’, ‘dml:x’, ‘cpu’], where ‘x’ is the device id.
Note
The order of EPs should follow the fallback logic. For example, to allow the model to run with CUDA or CPU, the EP list should be [‘cuda:0’, ‘cpu’], as layers that can’t run in CUDA can fall back to CPU, but not the other way. If TensorRT should also be enabled, then the EP list should be [‘trt’, ‘cuda:0’, ‘cpu’].
op_types_to_quantize (list[str]) – List of op types to quantize. If None (default), all supported operators are quantized. This flag does not support regular expression.
op_types_to_exclude (list[str]) – List of op types to exclude from quantization. This flag does not support regular expression.
nodes_to_quantize (list[str]) – List of node names to quantize. If None (default), all supported nodes are quantized. This flag supports regular expression.
nodes_to_exclude (list[str]) – List of node names to exclude from quantization. This flag supports regular expression.
use_external_data_format (bool) – If True, separate data path will be used to store the weights of the quantized model.
keep_intermediate_files (bool) – If True, keep all intermediate files generated during the ONNX model’s conversion/calibration.
output_path (str) – Output filename to save the quantized ONNX model. If None, save in the same directory as the original ONNX model with .quant suffix.
verbose (bool) – If True, print details of node partition, selection etc. throughout the quantization process.
trt_plugins (str) – Specifies custom TensorRT plugin library paths in .so format (compiled shared library). For multiple paths, separate them with a semicolon, i.e.: “lib_1.so;lib_2.so”. If this is not None or the model has custom ops, TensorrtExecutionProvider becomes the first choice as calibration execution provider, meaning that the TensorRT is a requirement.
trt_plugins_precision (list[str]) – A space-separated list indicating the precision for each custom op. Each item should have the format <op_type>:<precision>, where precision can be fp32 (default) or fp16. For example: op_type_1:fp16 op_type_2:fp32.
high_precision_dtype (str) – High precision data type, one of [‘fp32’, ‘fp16’]. If high_precision_dtype == ‘fp16’, model’s weight and activation will be converted to fp16.
mha_accumulation_dtype (str) – MHA accumulation dtype. One of [‘fp32’, ‘fp16’]. ‘fp16’ by default. If quantize_mode == ‘fp8’ and mha_accumulation_dtype == ‘fp32’, Cast nodes will be added to MHA’s bmm1 and bmm2’s input and output tensors.
disable_mha_qdq (bool) – Don’t add Q/DQ layers to MatMuls in MHA pattern.
dq_only (bool) – If True (default), only add DQ nodes to the model. If False, add Q/DQ nodes to the model.
block_size (int | None) – Block size parameter for int4 quantization.
use_zero_point (bool) – Use zero-point based quantization, if True.
kwargs (Any) – Additional keyword arguments for int4 quantization, including: - awqlite_alpha_step (float): Alpha step for lite, range [0, 1]. - awqclip_alpha_step (float): Min alpha step for clip, range [awqclip_alpha_step, 1]. - awqclip_alpha_min (float): Alpha step to find best alpha for clip. - awqclip_bsz_col (int): Batch size for processing the column dimension in clip.
calibration_shapes (str) –
- Returns:
None, writes the quantized onnx model in the supplied output_path or writes to the same directory with filename like “<model_name>.quant.onnx”.
- Return type:
None