int4

Performs INT4 WoQ on an ONNX model, and returns the ONNX ModelProto.

Functions

quantize

Applies INT4 Weight-Only-Quantization (WoQ) to an ONNX model.

quantize(onnx_path, calibration_method='awq_lite', calibration_data_reader=None, calibration_eps=['cuda:0', 'dml:0', 'cpu'], use_external_data_format=True, use_zero_point=False, block_size=None, nodes_to_exclude=['/lm_head'], **kwargs)

Applies INT4 Weight-Only-Quantization (WoQ) to an ONNX model.

Currently, only MatMul nodes quantization is supported.

Parameters:
  • onnx_path (str | ModelProto) – Input ONNX model (base model)

  • calibration_method (str) –

    It determines the quantization algorithm. Few important algorithms are:

    • awq_lite: Applies AWQ scaling (Alpha search) followed by INT4 quantization.

    • awq_clip: Executes weight clipping and INT4 quantization.

  • calibration_data_reader (CalibrationDataReader) – It can be assigned a list of model inputs. If it is None, then a randomly generated model input will be used for calibration in AWQ implementation.

  • calibration_eps (List[str]) –

    It denotes ONNX Execution Providers (EPs) to use for base model calibration. This list of EPs is then passed to create-session API of the onnxruntime (ORT) to perform base model calibration.

    Note

    Make sure that ORT package for chosen calibration-EPs is setup properly along with their dependencies.

  • use_external_data_format (bool) – If True, save tensors to external file(s) for quantized model.

  • use_zero_point (bool) – If True, enables zero-point based quantization.

  • block_size (int | None) – Block size parameter for int4 quantization. Default value of 128 is used for block_size parameter.

  • nodes_to_exclude (List[str] | None) –

    List of node-names (or substrings of node-names) denoting the nodes to

    exclude from quantization.

    Note

    By default, lm-head node is NOT quantized.

  • kwargs (Any) –

    It denotes additional keyword arguments for int4 quantization. It includes:

    • awqlite_alpha_step (float): Step size to find best Alpha in awq-lite.Range: [0, 1].

      Default: 0.1.

    • awqclip_alpha_step (float): Step size to find best Alpha in awq-clip.

      Default: 0.05

    • awqclip_alpha_min (float): Minimum threshold for weight-clipping in awq-clip.

      Default: 0.5.

    • awqclip_bsz_col (int): Batch size for processing the column dimension in awq-clip.

      Default: 1024.

Return type:

ModelProto

Returns: A quantized ONNX model in ONNX ModelProto format.