quant_utils
Provides some basic utilities that can be used in quantize() methods.
Functions
Returns quantized per block weight scaling factor. |
|
Returns per tensor weight scaling factor. |
|
Convert an array of float32 value to a 4bit data-type and pack every two concecutive elements in a byte. |
|
Convert an array of float32 value to a 4bit data-type and pack every two concecutive elements in a byte. |
|
Converting a tensor to a quantized format based on NVFP4 quantization. |
- get_weights_scaling_factor(input, block_size, weights_scaling_factor_2)
Returns quantized per block weight scaling factor.
- Parameters:
input (ndarray) –
block_size (int) –
weights_scaling_factor_2 (float32) –
- get_weights_scaling_factor_2(input)
Returns per tensor weight scaling factor.
- Parameters:
input (ndarray) –
- pack_float32_to_4bit_cpp_based(array, signed)
Convert an array of float32 value to a 4bit data-type and pack every two concecutive elements in a byte.
This is the optimized version of pack_float32_to_4bit() utility in ONNX helper file. The basic optimizations here is to implement this round_and_pack logic in C++, which is supposed to be faster.
- Parameters:
array (ndarray | Sequence) – array of float to convert and pack
signed (bool) – Whether the 4 bit variant is signed or unsigned
- Returns:
Packed array with size ceil(array.size/2) (single dimension).
- Return type:
ndarray
- pack_float32_to_4bit_optimized(array, signed)
Convert an array of float32 value to a 4bit data-type and pack every two concecutive elements in a byte.
This is the optimized version of pack_float32_to_4bit() utility in ONNX helper file. The basic optimizations done here mainly rely on moving some common code out of the per-element function calls or loops, thereby making them per-input-array, instead of per-input-element. The remaining logic should largely remain as is.
- Parameters:
array (ndarray | Sequence) – array of float to convert and pack
signed (bool) – Whether the 4 bit variant is signed or unsigned
- Returns:
Packed array with size ceil(array.size/2) (single dimension).
- Return type:
ndarray
- quantize(input, block_size, weights_scaling_factor, weights_scaling_factor_2)
Converting a tensor to a quantized format based on NVFP4 quantization.
- Parameters:
input (ndarray) –
block_size (int) –
weights_scaling_factor (ndarray) –
weights_scaling_factor_2 (ndarray) –