int4
Performs INT4 WoQ on an ONNX model, and returns the ONNX ModelProto.
Classes
AWQ calibration helper class. |
|
AWQ Lite calibration helper class. |
Functions
Dequantizes w with scale factors s. |
|
Find scale factors for w via s = max(w.block(block_size)) / 7. |
|
Get scale tensors for inputs. |
|
Method to return subgraph related maps based on activation-name as key. |
|
Get AWQ lite scales as described by 's' in the paper. |
|
Get scale tensors for weights. |
|
This method returns x-mean and w-mean. |
|
Quantize a tensor using alpha etc. |
|
Applies INT4 WoQ (Weight-Only-Quantization) to an ONNX file. |
|
Quantizes onnx_model using the RTN (Round-to-Nearest) algorithm. |
|
Quantizes w with scale factors s via Round-to-Nearest. |
|
Method that iterates over each quantizable node for scale search. |
|
Method that iterates over each quantizable subgraph/siblings for scale search. |
- class AWQClipHelper
Bases:
object
AWQ calibration helper class.
- __init__(w, block_size, **kwargs)
Initializes AWQClipHelper with a module weight.
- Parameters:
block_size (int) –
- alpha_step = 0.05
- min_alpha = 0.5
- update_best_params()
Updates the loss dictionary.
- class AWQLiteHelper
Bases:
object
AWQ Lite calibration helper class.
- __init__(x, w, block_size, **kwargs)
Initializes AWQLiteHelper with a module weight.
- Parameters:
block_size (int) –
- alpha_step = 0.1
- update_best_params()
Updates best-alpha and best-scale.
- dq_tensor(w, s, block_size, zp=None)
Dequantizes w with scale factors s.
- Parameters:
w (ndarray) –
s (ndarray) –
block_size (int) –
zp (ndarray) –
- Return type:
ndarray
- find_scales(w, block_size, alpha=1.0, use_zero_point=False)
Find scale factors for w via s = max(w.block(block_size)) / 7.
- Parameters:
w (ndarray) –
block_size (int) –
alpha (float) –
use_zero_point (bool) –
- get_act_scale(x)
Get scale tensors for inputs.
- get_act_to_weight_map_and_act_to_wa_pack_map(wa_pack)
Method to return subgraph related maps based on activation-name as key.
This method returns 2 maps: (a) map of act-name to input-node’s weights dimensions (b) map of act-name to wa_pack indices with same act-name
- Parameters:
wa_pack (List[Tuple[Tensor, Tensor, bool, int]]) –
- get_scale(x_max, w_max, alpha, reduce_across_tp=False)
Get AWQ lite scales as described by ‘s’ in the paper.
- get_weight_scale(weight, block_size=None)
Get scale tensors for weights.
- get_x_w_mean_for_subgraph(wa_pack, wa_pack_idx_list, augmented_onnx_path, x, block_size)
This method returns x-mean and w-mean.
- Parameters:
wa_pack (List[Tuple[Tensor, Tensor, bool, int]]) –
- quant_tensor(w, block_size, alpha=1.0, use_zero_point=False)
Quantize a tensor using alpha etc. and return the quantized tensor.
- Parameters:
w (ndarray) –
block_size (int) –
alpha (float) –
use_zero_point (bool) –
- quantize(onnx_path, calibration_method='awq_clip', calibration_data_reader=None, calibration_eps=['cuda:0', 'cpu'], use_external_data_format=True, use_zero_point=False, block_size=None, nodes_to_exclude=['/lm_head'], **kwargs)
Applies INT4 WoQ (Weight-Only-Quantization) to an ONNX file.
Currently only GEMM quantization is supported.
- use_zero_point:
Use zero-point based quantization, if True.
- block_size:
Block size parameter for int4 quantization.
- kwargs:
Additional keyword arguments for int4 quantization, including: - awqlite_alpha_step (float): Alpha step for lite, range [0, 1]. - awqclip_alpha_step (float): Min alpha step for clip, range [awqclip_alpha_step, 1]. - awqclip_alpha_min (float): Alpha step to find best alpha for clip. - awqclip_bsz_col (int): Batch size for processing the column dimension in clip.
- Parameters:
onnx_path (str | ModelProto) –
calibration_method (str) –
calibration_data_reader (CalibrationDataReader) –
calibration_eps (List[str]) –
use_external_data_format (bool) –
use_zero_point (bool) –
block_size (int | None) –
nodes_to_exclude (List[str] | None) –
kwargs (Any) –
- Return type:
ModelProto
- quantize_rtn(onnx_model, gemm_io_type, block_size, dq_only=False)
Quantizes onnx_model using the RTN (Round-to-Nearest) algorithm.
This algorithm computes scale factors by computing s = max(abs(block)) / 8, for each block. The quantized weights are computed via Q(w) = round_to_even(w / s), where round_to_even denotes rounding ties to the nearest even integer (i.e. 1.5, 2.5 both round to 2).
Always selects the first dimension (0) to block over. This is because we must batch over the Cin dimension, and in ONNX, weights are always plugged into the RHS (i.e. y = x @ W).
- Parameters:
onnx_model (ModelProto) –
gemm_io_type (<google.protobuf.internal.enum_type_wrapper.EnumTypeWrapper object at 0x7f4d2fb110d0>) –
block_size (int) –
dq_only (bool) –
- Return type:
ModelProto
- rtn(w, s, block_size, zp=None)
Quantizes w with scale factors s via Round-to-Nearest.
Ties are broken by rounding to the nearest even number.
- Parameters:
w (ndarray) –
s (ndarray) –
block_size (int) –
zp (ndarray) –
- Return type:
ndarray
- run_awq_scale_search_per_node(wa_pack, augmented_onnx_path, block_size, use_zero_point, session, awq_lite, inputs, tqdm_msg_append_str, enable_weight_clipping, enable_fast_path_using_high_sysram, output_data, clip_alphas, **kwargs)
Method that iterates over each quantizable node for scale search.
- Parameters:
wa_pack (List[Tuple[Tensor, Tensor, bool, int]]) –
kwargs (Any) –
- run_awq_scale_search_per_subgraph(wa_pack, augmented_onnx_path, block_size, use_zero_point, session, awq_lite, inputs, tqdm_msg_append_str, **kwargs)
Method that iterates over each quantizable subgraph/siblings for scale search.
- Parameters:
wa_pack (List[Tuple[Tensor, Tensor, bool, int]]) –
kwargs (Any) –