quant_utils

Provides some basic utilities that can be used in quantize() methods.

Functions

compute_e8m0

Computes the e8m0 value for the weight tensor.

dq_tensor

Dequantizes w with scale factors s.

find_scales

Find scale factors for w via s = max(w.block(block_size)) / 7.

get_amax

Returns the amax of the weight tensor along the specified axis for a given block size.

get_layer_axis

Get the quantization axis for a specific layer from layer_info.

get_layer_block_size

Get the block size for a specific layer from layer_info.

get_num_bits

Determine the layer configuration for quantization from layer_info.

get_weights_scaling_factor

Returns quantized per block weight scaling factor.

get_weights_scaling_factor_2

Returns per tensor weight scaling factor.

pack_float32_to_4bit_cpp_based

Convert an array of float32 value to a 4bit data-type and pack every two consecutive elements in a byte.

pack_float32_to_4bit_optimized

Convert an array of float32 value to a 4bit data-type and pack every two consecutive elements in a byte.

pack_weights_to_int4

Converts ONNX model weights from high precision to INT4 precision.

quant_tensor

Quantize a tensor using alpha etc.

quantize

Converting a tensor to a quantized format based on NVFP4 quantization.

reshape_scales_for_per_channel_nodes

Update the scale map for per-channel nodes.

rtn

Quantizes w with scale factors s via Round-to-Nearest.

update_block_size

Update the block size for quantization.

compute_e8m0(amax, weight_shape, quant_axis, block_size)

Computes the e8m0 value for the weight tensor.

Parameters:
  • amax (ndarray) – The amax of the weight tensor.

  • weight_shape (tuple[int, ...]) – The shape of the weight tensor.

  • quant_axis (int) – The axis to compute the e8m0 value.

  • block_size (int) – The block size.

Returns:

The e8m0 value for the weight tensor.

Return type:

ndarray

dq_tensor(w, s, block_size, quantize_axis=0, zp=None)

Dequantizes w with scale factors s.

Parameters:
  • w (ndarray)

  • s (ndarray)

  • block_size (int)

  • quantize_axis (int)

  • zp (ndarray)

Return type:

ndarray

find_scales(w, block_size, quantize_axis=0, alpha=1.0, use_zero_point=False, num_bits=4)

Find scale factors for w via s = max(w.block(block_size)) / 7.

Parameters:
  • w (ndarray)

  • block_size (int)

  • quantize_axis (int)

  • alpha (float)

  • use_zero_point (bool)

  • num_bits (int)

get_amax(weight, quant_axis, block_size)

Returns the amax of the weight tensor along the specified axis for a given block size.

Only 2D and 3D tensors are supported.

Parameters:
  • weight (ndarray) – The weight tensor.

  • quant_axis (int) – The axis to quantize.

  • block_size (int) – The block size.

Returns:

The amax of the weight tensor.

Return type:

ndarray

get_layer_axis(layer_info=None, name=None, default_axis=None)

Get the quantization axis for a specific layer from layer_info.

Parameters:
  • layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to layer configuration.

  • name (str | None) – Name of the tensor.

  • default_axis (int | None) – Default axis if not specified. Defaults to None.

Returns:

Quantization axis to use.

Return type:

int

get_layer_block_size(layer_info=None, name=None, default_block_size=None)

Get the block size for a specific layer from layer_info.

Parameters:
  • layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to layer configuration.

  • name (str | None) – Name of the tensor.

  • default_block_size (int | None) – Default block size if not specified. Defaults to None.

Returns:

Block size to use for quantization.

Return type:

int

get_num_bits(layer_info=None, name=None)

Determine the layer configuration for quantization from layer_info.

Parameters:
  • layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to layer configuration dict.

  • name (str | None) – Name of the tensor.

Returns:

Number of bits to use for quantization. Defaults to 4 if not specified.

Return type:

int

get_weights_scaling_factor(input, block_size, weights_scaling_factor_2)

Returns quantized per block weight scaling factor.

Parameters:
  • input (ndarray)

  • block_size (int)

  • weights_scaling_factor_2 (float32)

get_weights_scaling_factor_2(input)

Returns per tensor weight scaling factor.

Parameters:

input (ndarray)

pack_float32_to_4bit_cpp_based(array, signed)

Convert an array of float32 value to a 4bit data-type and pack every two consecutive elements in a byte.

This is the optimized version of pack_float32_to_4bit() utility in ONNX helper file. The basic optimizations here is to implement this round_and_pack logic in C++, which is supposed to be faster.

Parameters:
  • array (ndarray | Sequence) – array of float to convert and pack

  • signed (bool) – Whether the 4 bit variant is signed or unsigned

Returns:

Packed array with size ceil(array.size/2) (single dimension).

Return type:

ndarray

pack_float32_to_4bit_optimized(array, signed)

Convert an array of float32 value to a 4bit data-type and pack every two consecutive elements in a byte.

This is the optimized version of pack_float32_to_4bit() utility in ONNX helper file. The basic optimizations done here mainly rely on moving some common code out of the per-element function calls or loops, thereby making them per-input-array, instead of per-input-element. The remaining logic should largely remain as is.

Parameters:
  • array (ndarray | Sequence) – array of float to convert and pack

  • signed (bool) – Whether the 4 bit variant is signed or unsigned

Returns:

Packed array with size ceil(array.size/2) (single dimension).

Return type:

ndarray

pack_weights_to_int4(weight)

Converts ONNX model weights from high precision to INT4 precision.

Parameters:

weight (ndarray)

Return type:

ndarray

quant_tensor(w, block_size, quantize_axis=0, alpha=1.0, use_zero_point=False, num_bits=4)

Quantize a tensor using alpha etc. and return the quantized tensor.

Returns:

A tuple containing:
  • wq: The quantized weight tensor (np.ndarray)

  • scale: The scale factors used for quantization (np.ndarray)

  • zp: The zero-point values (np.ndarray or None if not using zero-point)

Return type:

tuple

Parameters:
  • w (ndarray)

  • block_size (int)

  • quantize_axis (int)

  • alpha (float)

  • use_zero_point (bool)

  • num_bits (int)

quantize(input, block_size, weights_scaling_factor, weights_scaling_factor_2)

Converting a tensor to a quantized format based on NVFP4 quantization.

Parameters:
  • input (ndarray)

  • block_size (int)

  • weights_scaling_factor (ndarray)

  • weights_scaling_factor_2 (ndarray)

reshape_scales_for_per_channel_nodes(scales_map, block_size, layer_info=None)

Update the scale map for per-channel nodes. For per channel quantization the scale needs to be 1D.

Parameters:
  • scales_map (dict[str, np.ndarray]) – Dictionary mapping weight names to scale arrays.

  • layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to layer configuration dict.

  • block_size (int)

Returns:

Updated scales map.

Return type:

dict[str, np.ndarray]

rtn(w, s, block_size, quantize_axis=0, zp=None, num_bits=4)

Quantizes w with scale factors s via Round-to-Nearest.

Ties are broken by rounding to the nearest even number.

Parameters:
  • w (ndarray)

  • s (ndarray)

  • block_size (int)

  • quantize_axis (int)

  • zp (ndarray)

  • num_bits (int)

Return type:

ndarray

update_block_size(block_size, layer_info=None, name=None, quantize_axis=0, w=None)

Update the block size for quantization.

Parameters:
  • num_bits (int) – Number of bits for quantization.

  • layer_info (dict[str, dict] | None) – Optional dictionary mapping tensor names to layer configuration dict.

  • name (str | None) – Name of the tensor.

  • block_size (int) – Current block size.

  • quantize_axis (int) – Axis along which to quantize.

  • w (np.ndarray) – Weight tensor to be quantized.

Returns:

Updated block size.

Return type:

int