quant_utils

Provides some basic utilities that can be used in quantize() methods.

Functions

`compute_e8m0`	Computes the e8m0 value for the weight tensor.
`get_amax`	Returns the amax of the weight tensor along the specified axis for a given block size.
`get_weights_scaling_factor`	Returns quantized per block weight scaling factor.
`get_weights_scaling_factor_2`	Returns per tensor weight scaling factor.
`pack_float32_to_4bit_cpp_based`	Convert an array of float32 value to a 4bit data-type and pack every two concecutive elements in a byte.
`pack_float32_to_4bit_optimized`	Convert an array of float32 value to a 4bit data-type and pack every two concecutive elements in a byte.
`quantize`	Converting a tensor to a quantized format based on NVFP4 quantization.

compute_e8m0(amax, weight_shape, quant_axis, block_size)

Computes the e8m0 value for the weight tensor.

Parameters:

amax (ndarray) – The amax of the weight tensor.
weight_shape (tuple[int, ...]) – The shape of the weight tensor.
quant_axis (int) – The axis to compute the e8m0 value.
block_size (int) – The block size.

Returns:

The e8m0 value for the weight tensor.

Return type:

ndarray

get_amax(weight, quant_axis, block_size)

Returns the amax of the weight tensor along the specified axis for a given block size.

Only 2D and 3D tensors are supported.

Parameters:

weight (ndarray) – The weight tensor.
quant_axis (int) – The axis to quantize.
block_size (int) – The block size.

Returns:

The amax of the weight tensor.

Return type:

ndarray

get_weights_scaling_factor(input, block_size, weights_scaling_factor_2)

Returns quantized per block weight scaling factor.

Parameters:

input (ndarray)
block_size (int)
weights_scaling_factor_2 (float32)

get_weights_scaling_factor_2(input)

Returns per tensor weight scaling factor.

Parameters:: input (ndarray)

pack_float32_to_4bit_cpp_based(array, signed)

Convert an array of float32 value to a 4bit data-type and pack every two concecutive elements in a byte.

This is the optimized version of pack_float32_to_4bit() utility in ONNX helper file. The basic optimizations here is to implement this round_and_pack logic in C++, which is supposed to be faster.

Parameters:

array (ndarray | Sequence) – array of float to convert and pack
signed (bool) – Whether the 4 bit variant is signed or unsigned

Returns:

Packed array with size ceil(array.size/2) (single dimension).

Return type:

ndarray

pack_float32_to_4bit_optimized(array, signed)

Convert an array of float32 value to a 4bit data-type and pack every two concecutive elements in a byte.

This is the optimized version of pack_float32_to_4bit() utility in ONNX helper file. The basic optimizations done here mainly rely on moving some common code out of the per-element function calls or loops, thereby making them per-input-array, instead of per-input-element. The remaining logic should largely remain as is.

Parameters:

array (ndarray | Sequence) – array of float to convert and pack
signed (bool) – Whether the 4 bit variant is signed or unsigned

Returns:

Packed array with size ceil(array.size/2) (single dimension).

Return type:

ndarray

quantize(input, block_size, weights_scaling_factor, weights_scaling_factor_2)

Converting a tensor to a quantized format based on NVFP4 quantization.

Parameters:

input (ndarray)
block_size (int)
weights_scaling_factor (ndarray)
weights_scaling_factor_2 (ndarray)