int4

Performs INT4 WoQ on an ONNX model, and returns the ONNX ModelProto.

Classes

AWQClipHelper

AWQ calibration helper class.

AWQLiteHelper

AWQ Lite calibration helper class.

Functions

dq_tensor

Dequantizes w with scale factors s.

find_scales

Find scale factors for w via s = max(w.block(block_size)) / 7.

get_act_scale

Get scale tensors for inputs.

get_act_to_weight_map_and_act_to_wa_pack_map

Method to return subgraph related maps based on activation-name as key.

get_parent_child_nodes_map

Get mapping of parent nodes to their MatMul/Gemm nodes with quantizable weights.

get_scale

Get AWQ lite scales as described by 's' in the paper.

get_weight_scale

Get scale tensors for weights.

get_x_w_mean_for_subgraph

This method returns x-mean and w-mean.

quant_tensor

Quantize a tensor using alpha etc.

quantize

Applies INT4 Weight-Only-Quantization (WoQ) to an ONNX model.

quantize_rtn

Quantizes onnx_model using the RTN (Round-to-Nearest) algorithm.

rtn

Quantizes w with scale factors s via Round-to-Nearest.

run_awq_scale_search_per_node

Method that iterates over each quantizable node for scale search.

run_awq_scale_search_per_subgraph

Method that iterates over each quantizable subgraph/siblings for scale search.

class AWQClipHelper

Bases: object

AWQ calibration helper class.

__init__(w, block_size, **kwargs)

Initializes AWQClipHelper with a module weight.

Parameters:

block_size (int) –

alpha_step = 0.05
min_alpha = 0.5
update_best_params()

Updates the loss dictionary.

class AWQLiteHelper

Bases: object

AWQ Lite calibration helper class.

__init__(x, w, block_size, **kwargs)

Initializes AWQLiteHelper with a module weight.

Parameters:

block_size (int) –

alpha_step = 0.1
update_best_params()

Updates best-alpha and best-scale.

dq_tensor(w, s, block_size, zp=None)

Dequantizes w with scale factors s.

Parameters:
  • w (ndarray) –

  • s (ndarray) –

  • block_size (int) –

  • zp (ndarray) –

Return type:

ndarray

find_scales(w, block_size, alpha=1.0, use_zero_point=False)

Find scale factors for w via s = max(w.block(block_size)) / 7.

Parameters:
  • w (ndarray) –

  • block_size (int) –

  • alpha (float) –

  • use_zero_point (bool) –

get_act_scale(x)

Get scale tensors for inputs.

get_act_to_weight_map_and_act_to_wa_pack_map(wa_pack)

Method to return subgraph related maps based on activation-name as key.

This method returns 2 maps: (a) map of act-name to input-node’s weights dimensions (b) map of act-name to wa_pack indices with same act-name

Parameters:

wa_pack (List[Tuple[Tensor, Tensor, bool, int]]) –

get_parent_child_nodes_map(graph, wa_pack)

Get mapping of parent nodes to their MatMul/Gemm nodes with quantizable weights.

Parameters:
  • graph (GraphProto) –

  • wa_pack (List[Tuple[Tensor, Tensor, bool, int]]) –

get_scale(x_max, w_max, alpha, reduce_across_tp=False)

Get AWQ lite scales as described by ‘s’ in the paper.

get_weight_scale(weight, block_size=None)

Get scale tensors for weights.

get_x_w_mean_for_subgraph(wa_pack, wa_pack_idx_list, augmented_onnx_path, x, block_size)

This method returns x-mean and w-mean.

Parameters:

wa_pack (List[Tuple[Tensor, Tensor, bool, int]]) –

quant_tensor(w, block_size, alpha=1.0, use_zero_point=False)

Quantize a tensor using alpha etc. and return the quantized tensor.

Parameters:
  • w (ndarray) –

  • block_size (int) –

  • alpha (float) –

  • use_zero_point (bool) –

quantize(onnx_path, calibration_method='awq_lite', calibration_data_reader=None, calibration_eps=['cuda:0', 'dml:0', 'cpu'], use_external_data_format=True, use_zero_point=False, block_size=None, nodes_to_exclude=['/lm_head'], **kwargs)

Applies INT4 Weight-Only-Quantization (WoQ) to an ONNX model.

Currently, only MatMul nodes quantization is supported.

Parameters:
  • onnx_path (str | ModelProto) – Input ONNX model (base model)

  • calibration_method (str) –

    It determines the quantization algorithm. Few important algorithms are:

    • awq_lite: Applies AWQ scaling (Alpha search) followed by INT4 quantization.

    • awq_clip: Executes weight clipping and INT4 quantization.

  • calibration_data_reader (CalibrationDataReader) – It can be assigned a list of model inputs. If it is None, then a randomly generated model input will be used for calibration in AWQ implementation.

  • calibration_eps (List[str]) –

    It denotes ONNX Execution Providers (EPs) to use for base model calibration. This list of EPs is then passed to create-session API of the onnxruntime (ORT) to perform base model calibration.

    Note

    Make sure that ORT package for chosen calibration-EPs is setup properly along with their dependencies.

  • use_external_data_format (bool) – If True, save tensors to external file(s) for quantized model.

  • use_zero_point (bool) – If True, enables zero-point based quantization.

  • block_size (int | None) – Block size parameter for int4 quantization. Default value of 128 is used for block_size parameter.

  • nodes_to_exclude (List[str] | None) –

    List of node-names (or substrings of node-names) denoting the nodes to

    exclude from quantization.

    Note

    By default, lm-head node is NOT quantized.

  • kwargs (Any) –

    It denotes additional keyword arguments for int4 quantization. It includes:

    • awqlite_alpha_step (float): Step size to find best Alpha in awq-lite.Range: [0, 1].

      Default: 0.1.

    • awqclip_alpha_step (float): Step size to find best Alpha in awq-clip.

      Default: 0.05

    • awqclip_alpha_min (float): Minimum threshold for weight-clipping in awq-clip.

      Default: 0.5.

    • awqclip_bsz_col (int): Batch size for processing the column dimension in awq-clip.

      Default: 1024.

Return type:

ModelProto

Returns: A quantized ONNX model in ONNX ModelProto format.

quantize_rtn(onnx_model, gemm_io_type, block_size, dq_only=False)

Quantizes onnx_model using the RTN (Round-to-Nearest) algorithm.

This algorithm computes scale factors by computing s = max(abs(block)) / 8, for each block. The quantized weights are computed via Q(w) = round_to_even(w / s), where round_to_even denotes rounding ties to the nearest even integer (i.e. 1.5, 2.5 both round to 2).

Always selects the first dimension (0) to block over. This is because we must batch over the Cin dimension, and in ONNX, weights are always plugged into the RHS (i.e. y = x @ W).

Parameters:
  • onnx_model (ModelProto) –

  • gemm_io_type (<google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x7f553f7fadb0>) –

  • block_size (int) –

  • dq_only (bool) –

Return type:

ModelProto

rtn(w, s, block_size, zp=None)

Quantizes w with scale factors s via Round-to-Nearest.

Ties are broken by rounding to the nearest even number.

Parameters:
  • w (ndarray) –

  • s (ndarray) –

  • block_size (int) –

  • zp (ndarray) –

Return type:

ndarray

run_awq_scale_search_per_node(wa_pack, augmented_onnx_path, block_size, use_zero_point, session, awq_lite, inputs, tqdm_msg_append_str, enable_weight_clipping, enable_fast_path_using_high_sysram, output_data, clip_alphas, **kwargs)

Method that iterates over each quantizable node for scale search.

Parameters:
  • wa_pack (List[Tuple[Tensor, Tensor, bool, int]]) –

  • kwargs (Any) –

run_awq_scale_search_per_subgraph(wa_pack, act_to_wa_pack_map, act_to_quant_nodes_weight_shape_map, augmented_onnx_path, block_size, use_zero_point, session, awq_lite, inputs, tqdm_msg_append_str, **kwargs)

Method that iterates over each quantizable subgraph/siblings for scale search.

Parameters:
  • wa_pack (List[Tuple[Tensor, Tensor, bool, int]]) –

  • kwargs (Any) –