modelopt.onnx.quantization.quantize
- quantize(onnx_path, quantize_mode='int8', calibration_data=None, calibration_method=None, calibration_cache_path=None, calibration_shapes=None, calibration_eps=['cuda:0', 'cpu', 'trt'], op_types_to_quantize=None, op_types_to_exclude=None, nodes_to_quantize=None, nodes_to_exclude=None, use_external_data_format=False, keep_intermediate_files=False, output_path=None, verbose=False, trt_plugins=None, trt_plugins_precision=None, high_precision_dtype=None, mha_accumulation_dtype='fp32', disable_mha_qdq=False, dq_only=True, block_size=None, use_zero_point=False, **kwargs)
Quantizes the provided ONNX model.
- Parameters:
onnx_path (str) – Path to the input ONNX model.
quantize_mode (str) – Quantization mode. One of ‘int8’ (default), ‘int4’ and ‘fp8’.
calibration_data (ndarray | Dict[str, ndarray]) – Calibration data, either a numpy array or list/dict of numpy arrays.
calibration_method (str) – Calibration method choices. Options are int8: ‘entropy’ (default) and ‘max’, fp8: ‘max’ (default) and int4: ‘awq_clip’ (default), ‘awq_lite’, ‘awq_full’ and ‘rtn_dq’.
calibration_cache_path (str) – Path to pre-calculated activation tensor ranges, also known as calibration cache.
calibration_eps (List[str]) – Priority order for the execution providers (EP) to calibrate the model. Any subset of [‘cuda:x’, ‘dml:x’, ‘cpu’, ‘trt’], where ‘x’ is the device id.
op_types_to_quantize (List[str]) – List of op types to quantize. If None (default), all supported operators are quantized. This flag does not support regular expression.
op_types_to_exclude (List[str]) – List of op types to exclude from quantization. This flag does not support regular expression.
nodes_to_quantize (List[str]) – List of node names to quantize. If None (default), all supported nodes are quantized. This flag supports regular expression.
nodes_to_exclude (List[str]) – List of node names to exclude from quantization. This flag supports regular expression.
use_external_data_format (bool) – If True, separate data path will be used to store the weights of the quantized model.
keep_intermediate_files (bool) – If True, keep all intermediate files generated during the ONNX model’s conversion/calibration.
output_path (str) – Output filename to save the quantized ONNX model. If None, save in the same directory as the original ONNX model with .quant suffix.
verbose (bool) – If True, print details of node partition, selection etc. throughout the quantization process.
trt_plugins (str) – Specifies custom TensorRT plugin library paths in .so format (compiled shared library). For multiple paths, separate them with a semicolon, i.e.: “lib_1.so;lib_2.so”. If this is not None or the model has custom ops, TensorrtExecutionProvider becomes the first choice as calibration execution provider, meaning that the TensorRT is a requirement.
trt_plugins_precision (List[str]) – A space-separated list indicating the precision for each custom op. Each item should have the format <op_type>:<precision>, where precision can be fp32 (default) or fp16. For example: op_type_1:fp16 op_type_2:fp32.
high_precision_dtype (str) – High precision data type, one of [‘fp32’, ‘fp16’]. If high_precision_dtype == ‘fp16’, model’s weight and activation will be converted to fp16.
mha_accumulation_dtype (str) – MHA accumulation dtype. One of [‘fp32’, ‘fp16’]. ‘fp32’ by default. If quantize_mode == ‘fp8’ and high_precision_dtype == ‘fp32’, Cast nodes will be added to MHA’s bmm1 and bmm2’s input and output tensors.
disable_mha_qdq (bool) – Don’t add Q/DQ layers to MatMuls in MHA pattern.
dq_only (bool) – If True (default), only add DQ nodes to the model. If False, add Q/DQ nodes to the model.
block_size (int | None) – Block size parameter for int4 quantization.
use_zero_point (bool) – Use zero-point based quantization, if True.
kwargs (Any) – Additional keyword arguments for int4 quantization, including: - awqlite_alpha_step (float): Alpha step for lite, range [0, 1]. - awqclip_alpha_step (float): Min alpha step for clip, range [awqclip_alpha_step, 1]. - awqclip_alpha_min (float): Alpha step to find best alpha for clip. - awqclip_bsz_col (int): Batch size for processing the column dimension in clip.
calibration_shapes (str) –
- Returns:
None, writes the quantized onnx model in the supplied output_path or writes to the same directory with filename like “<model_name>.quant.onnx”.
- Return type:
None