int4

Performs INT4 WoQ on an ONNX model, and returns the ONNX ModelProto.

Classes

AWQClipHelper

AWQ calibration helper class.

AWQLiteHelper

AWQ Lite calibration helper class.

Functions

dq_tensor

Dequantizes w with scale factors s.

find_scales

Find scale factors for w via s = max(w.block(block_size)) / 7.

get_act_scale

Get scale tensors for inputs.

get_scale

Get AWQ lite scales as described by 's' in the paper.

get_weight_scale

Get scale tensors for weights.

quant_tensor

Quantize a tensor using alpha etc.

quantize

Applies INT4 WoQ (Weight-Only-Quantization) to an ONNX file.

quantize_awq_clip

Quantizes onnx_model using the Activation aware quantization a.k.a AWQ algorithm.

quantize_awq_lite

Quantizes onnx_model using the Activation aware quantization a.k.a AWQ algorithm.

quantize_rtn

Quantizes onnx_model using the RTN (Round-to-Nearest) algorithm.

rtn

Quantizes w with scale factors s via Round-to-Nearest.

class AWQClipHelper

Bases: object

AWQ calibration helper class.

__init__(w, block_size)

Initializes AWQClipHelper with a module weight.

Parameters:

block_size (int) –

alpha_step = 0.05
alphas = [0.5, 0.55, 0.6, 0.65, 0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 1.0]
min_alpha = 0.5
update_best_params()

Updates the loss dictionary.

class AWQLiteHelper

Bases: object

AWQ Lite calibration helper class.

__init__(x, w, block_size)

Initializes AWQLiteHelper with a module weight.

Parameters:

block_size (int) –

alpha_step = 0.1
dq_tensor(w, s, block_size)

Dequantizes w with scale factors s.

Parameters:
  • w (ndarray) –

  • s (ndarray) –

  • block_size (int) –

Return type:

ndarray

find_scales(w, block_size, alpha=1.0)

Find scale factors for w via s = max(w.block(block_size)) / 7.

Parameters:
  • w (ndarray) –

  • block_size (int) –

  • alpha (float) –

Return type:

ndarray

get_act_scale(x)

Get scale tensors for inputs.

get_scale(x_max, w_max, alpha, reduce_across_tp=False)

Get AWQ lite scales as described by ‘s’ in the paper.

get_weight_scale(weight, block_size=None)

Get scale tensors for weights.

quant_tensor(w, block_size, alpha=1.0)

Quantize a tensor using alpha etc. and return the quantized tensor.

Parameters:
  • w (ndarray) –

  • block_size (int) –

  • alpha (float) –

quantize(onnx_path, calibration_method='awq_clip', calibration_data_reader=None, use_external_data_format=True)

Applies INT4 WoQ (Weight-Only-Quantization) to an ONNX file.

Currently only GEMM quantization is supported.

Parameters:
  • onnx_path (str) –

  • calibration_method (str) –

  • calibration_data_reader (CalibrationDataReader) –

  • use_external_data_format (bool) –

Return type:

ModelProto

quantize_awq_clip(onnx_model, data_reader, use_external_data_format, force_fp16=False)

Quantizes onnx_model using the Activation aware quantization a.k.a AWQ algorithm.

Parameters:
  • onnx_model (ModelProto) –

  • data_reader (CalibrationDataReader) –

  • use_external_data_format (bool) –

  • force_fp16 (bool) –

Return type:

ModelProto

quantize_awq_lite(onnx_model, data_reader, use_external_data_format, force_fp16=False, enable_fast_path_using_high_sysram=False)

Quantizes onnx_model using the Activation aware quantization a.k.a AWQ algorithm.

Parameters:
  • onnx_model (ModelProto) –

  • data_reader (CalibrationDataReader) –

  • use_external_data_format (bool) –

  • force_fp16 (bool) –

  • enable_fast_path_using_high_sysram (bool) –

Return type:

ModelProto

quantize_rtn(onnx_model, gemm_io_type, dq_only=False)

Quantizes onnx_model using the RTN (Round-to-Nearest) algorithm.

This algorithm computes scale factors by computing s = max(abs(block)) / 8, for each block. The quantized weights are computed via Q(w) = round_to_even(w / s), where round_to_even denotes rounding ties to the nearest even integer (i.e. 1.5, 2.5 both round to 2).

Always selects the first dimension (0) to block over. This is because we must batch over the Cin dimension, and in ONNX, weights are always plugged into the RHS (i.e. y = x @ W).

Parameters:
  • onnx_model (ModelProto) –

  • gemm_io_type (<google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x7f1c4f143170>) –

  • dq_only (bool) –

Return type:

ModelProto

rtn(w, s, block_size)

Quantizes w with scale factors s via Round-to-Nearest.

Ties are broken by rounding to the nearest even number.

Parameters:
  • w (ndarray) –

  • s (ndarray) –

  • block_size (int) –

Return type:

ndarray