quant_utils
Provides some basic utilities that can be used in quantize() methods.
Functions
Computes the e8m0 value for the weight tensor. |
|
Dequantizes w with scale factors s. |
|
Find scale factors for w via s = max(w.block(block_size)) / 7. |
|
Returns the amax of the weight tensor along the specified axis for a given block size. |
|
Determine the number of bits for quantization from precision_info. |
|
Returns quantized per block weight scaling factor. |
|
Returns per tensor weight scaling factor. |
|
Convert an array of float32 value to a 4bit data-type and pack every two consecutive elements in a byte. |
|
Convert an array of float32 value to a 4bit data-type and pack every two consecutive elements in a byte. |
|
Converts ONNX model weights from high precision to INT4 precision. |
|
Quantize a tensor using alpha etc. |
|
Converting a tensor to a quantized format based on NVFP4 quantization. |
|
Update the scale map for per-channel nodes. |
|
Quantizes w with scale factors s via Round-to-Nearest. |
|
Update the block size for quantization. |
- compute_e8m0(amax, weight_shape, quant_axis, block_size)
Computes the e8m0 value for the weight tensor.
- Parameters:
amax (ndarray) – The amax of the weight tensor.
weight_shape (tuple[int, ...]) – The shape of the weight tensor.
quant_axis (int) – The axis to compute the e8m0 value.
block_size (int) – The block size.
- Returns:
The e8m0 value for the weight tensor.
- Return type:
ndarray
- dq_tensor(w, s, block_size, quantize_axis=0, zp=None)
Dequantizes w with scale factors s.
- Parameters:
w (ndarray)
s (ndarray)
block_size (int)
quantize_axis (int)
zp (ndarray)
- Return type:
ndarray
- find_scales(w, block_size, quantize_axis=0, alpha=1.0, use_zero_point=False, num_bits=4)
Find scale factors for w via s = max(w.block(block_size)) / 7.
- Parameters:
w (ndarray)
block_size (int)
quantize_axis (int)
alpha (float)
use_zero_point (bool)
num_bits (int)
- get_amax(weight, quant_axis, block_size)
Returns the amax of the weight tensor along the specified axis for a given block size.
Only 2D and 3D tensors are supported.
- Parameters:
weight (ndarray) – The weight tensor.
quant_axis (int) – The axis to quantize.
block_size (int) – The block size.
- Returns:
The amax of the weight tensor.
- Return type:
ndarray
- get_num_bits(precision_info=None, name=None)
Determine the number of bits for quantization from precision_info.
- Parameters:
precision_info (dict[str, int] | None) – Optional dictionary mapping tensor names to number of bits.
name (str | None) – Name of the tensor.
- Returns:
Number of bits to use for quantization. Defaults to 4 if not specified.
- Return type:
int
- get_weights_scaling_factor(input, block_size, weights_scaling_factor_2)
Returns quantized per block weight scaling factor.
- Parameters:
input (ndarray)
block_size (int)
weights_scaling_factor_2 (float32)
- get_weights_scaling_factor_2(input)
Returns per tensor weight scaling factor.
- Parameters:
input (ndarray)
- pack_float32_to_4bit_cpp_based(array, signed)
Convert an array of float32 value to a 4bit data-type and pack every two consecutive elements in a byte.
This is the optimized version of pack_float32_to_4bit() utility in ONNX helper file. The basic optimizations here is to implement this round_and_pack logic in C++, which is supposed to be faster.
- Parameters:
array (ndarray | Sequence) – array of float to convert and pack
signed (bool) – Whether the 4 bit variant is signed or unsigned
- Returns:
Packed array with size ceil(array.size/2) (single dimension).
- Return type:
ndarray
- pack_float32_to_4bit_optimized(array, signed)
Convert an array of float32 value to a 4bit data-type and pack every two consecutive elements in a byte.
This is the optimized version of pack_float32_to_4bit() utility in ONNX helper file. The basic optimizations done here mainly rely on moving some common code out of the per-element function calls or loops, thereby making them per-input-array, instead of per-input-element. The remaining logic should largely remain as is.
- Parameters:
array (ndarray | Sequence) – array of float to convert and pack
signed (bool) – Whether the 4 bit variant is signed or unsigned
- Returns:
Packed array with size ceil(array.size/2) (single dimension).
- Return type:
ndarray
- pack_weights_to_int4(weight)
Converts ONNX model weights from high precision to INT4 precision.
- Parameters:
weight (ndarray)
- Return type:
ndarray
- quant_tensor(w, block_size, quantize_axis=0, alpha=1.0, use_zero_point=False, num_bits=4)
Quantize a tensor using alpha etc. and return the quantized tensor.
- Returns:
- A tuple containing:
wq: The quantized weight tensor (np.ndarray)
scale: The scale factors used for quantization (np.ndarray)
zp: The zero-point values (np.ndarray or None if not using zero-point)
- Return type:
tuple
- Parameters:
w (ndarray)
block_size (int)
quantize_axis (int)
alpha (float)
use_zero_point (bool)
num_bits (int)
- quantize(input, block_size, weights_scaling_factor, weights_scaling_factor_2)
Converting a tensor to a quantized format based on NVFP4 quantization.
- Parameters:
input (ndarray)
block_size (int)
weights_scaling_factor (ndarray)
weights_scaling_factor_2 (ndarray)
- reshape_scales_for_per_channel_nodes(scales_map, block_size, precision_info=None)
Update the scale map for per-channel nodes. For per channel quantization the scale needs to be 1D.
- Parameters:
scales_map (dict[str, ndarray])
block_size (int)
precision_info (dict[str, int] | None)
- rtn(w, s, block_size, quantize_axis=0, zp=None, num_bits=4)
Quantizes w with scale factors s via Round-to-Nearest.
Ties are broken by rounding to the nearest even number.
- Parameters:
w (ndarray)
s (ndarray)
block_size (int)
quantize_axis (int)
zp (ndarray)
num_bits (int)
- Return type:
ndarray
- update_block_size(num_bits, block_size, quantize_axis=0, w=None)
Update the block size for quantization.
- Parameters:
num_bits (int) – Number of bits for quantization.
block_size (int) – Current block size. If -1, per-channel quantization is used.
quantize_axis (int) – Axis along which to quantize.
w (np.ndarray) – Weight tensor to be quantized.
- Returns:
Updated block size.
- Return type:
int