int4

Performs INT4 WoQ on an ONNX model, and returns the ONNX ModelProto.

Functions

quantize

Applies INT4 Weight-Only-Quantization (WoQ) to an ONNX model.

quantize(onnx_path, calibration_method='awq_lite', calibration_data_reader=None, calibration_eps=['cpu', 'cuda:0', 'trt'], use_external_data_format=True, use_zero_point=False, block_size=None, nodes_to_exclude=['/lm_head'], log_level='INFO', **kwargs)

Applies INT4 Weight-Only-Quantization (WoQ) to an ONNX model.

Currently, only MatMul nodes quantization is supported.

Parameters:

onnx_path (str | ModelProto) – Input ONNX model (base model)
calibration_method (str) –
It determines the quantization algorithm. Few important algorithms are:
- awq_lite: Applies AWQ scaling (Alpha search) followed by INT4 quantization.
- awq_clip: Executes weight clipping and INT4 quantization.
calibration_data_reader (CalibrationDataReader) – It can be assigned a list of model inputs. If it is None, then a randomly generated model input will be used for calibration in AWQ implementation.
calibration_eps (list[str]) –
It denotes ONNX Execution Providers (EPs) to use for base model calibration. This list of EPs is then passed to create-session API of the onnxruntime (ORT) to perform base model calibration.

Note

Make sure that ORT package for chosen calibration-EPs is setup properly along with their dependencies.
use_external_data_format (bool) – If True, save tensors to external file(s) for quantized model.
use_zero_point (bool) – If True, enables zero-point based quantization.
block_size (int | None) – Block size parameter for int4 quantization. Default value of 128 is used for block_size parameter.
nodes_to_exclude (list[str] | None) –

List of node-names (or substrings of node-names) denoting the nodes to
exclude from quantization.

Note

By default, lm-head node is NOT quantized.
kwargs (Any) –
It denotes additional keyword arguments for int4 quantization. It includes:
- awqlite_alpha_step (float): Step size to find best Alpha in awq-lite.Range: [0, 1].
  Default: 0.1.
- awqclip_alpha_step (float): Step size to find best Alpha in awq-clip.
  Default: 0.05
- awqclip_alpha_min (float): Minimum threshold for weight-clipping in awq-clip.
  Default: 0.5.
- awqclip_bsz_col (int): Batch size for processing the column dimension in awq-clip.
  Default: 1024.
log_level (str) – The logging level to use (default: logging.INFO)

Return type:

ModelProto

Returns: A quantized ONNX model in ONNX ModelProto format.