Performs INT4 WoQ on an ONNX model, and returns the ONNX ModelProto.
Applies INT4 Weight-Only-Quantization (WoQ) to an ONNX model. |
- quantize(onnx_path, calibration_method='awq_lite', calibration_data_reader=None, calibration_eps=['cuda:0', 'dml:0', 'cpu'], use_external_data_format=True, use_zero_point=False, block_size=None, nodes_to_exclude=['/lm_head'], **kwargs)
Applies INT4 Weight-Only-Quantization (WoQ) to an ONNX model.
Currently, only
nodes quantization is supported.- Parameters:
onnx_path (str | ModelProto) – Input ONNX model (base model)
calibration_method (str) –
It determines the quantization algorithm. Few important algorithms are:
awq_lite: Applies AWQ scaling (Alpha search) followed by INT4 quantization.
awq_clip: Executes weight clipping and INT4 quantization.
calibration_data_reader (CalibrationDataReader) – It can be assigned a list of model inputs. If it is
, then a randomly generated model input will be used for calibration in AWQ implementation.calibration_eps (list[str]) –
It denotes ONNX Execution Providers (EPs) to use for base model calibration. This list of EPs is then passed to create-session API of the onnxruntime (ORT) to perform base model calibration.
Make sure that ORT package for chosen calibration-EPs is setup properly along with their dependencies.
use_external_data_format (bool) – If True, save tensors to external file(s) for quantized model.
use_zero_point (bool) – If True, enables zero-point based quantization.
block_size (int | None) – Block size parameter for int4 quantization. Default value of 128 is used for
parameter.nodes_to_exclude (list[str] | None) –
- List of node-names (or substrings of node-names) denoting the nodes to
exclude from quantization.
By default,
node is NOT quantized.kwargs (Any) –
It denotes additional keyword arguments for int4 quantization. It includes:
- awqlite_alpha_step (float): Step size to find best Alpha in awq-lite.Range: [0, 1].
Default: 0.1.
- awqclip_alpha_step (float): Step size to find best Alpha in awq-clip.
Default: 0.05
- awqclip_alpha_min (float): Minimum threshold for weight-clipping in awq-clip.
Default: 0.5.
- awqclip_bsz_col (int): Batch size for processing the column dimension in awq-clip.
Default: 1024.
- Return type:
Returns: A quantized ONNX model in ONNX ModelProto format.