quant_utils

Provides some basic utilities that can be used in quantize() methods.

Functions

pack_float32_to_4bit_cpp_based

Convert an array of float32 value to a 4bit data-type and pack every two concecutive elements in a byte.

pack_float32_to_4bit_optimized

Convert an array of float32 value to a 4bit data-type and pack every two concecutive elements in a byte.

pack_float32_to_4bit_cpp_based(array, signed)

Convert an array of float32 value to a 4bit data-type and pack every two concecutive elements in a byte.

This is the optimized version of pack_float32_to_4bit() utility in ONNX helper file. The basic optimizations here is to implement this round_and_pack logic in C++, which is supposed to be faster.

Parameters:
  • array (ndarray | Sequence) – array of float to convert and pack

  • signed (bool) – Whether the 4 bit variant is signed or unsigned

Returns:

Packed array with size ceil(array.size/2) (single dimension).

Return type:

ndarray

pack_float32_to_4bit_optimized(array, signed)

Convert an array of float32 value to a 4bit data-type and pack every two concecutive elements in a byte.

This is the optimized version of pack_float32_to_4bit() utility in ONNX helper file. The basic optimizations done here mainly rely on moving some common code out of the per-element function calls or loops, thereby making them per-input-array, instead of per-input-element. The remaining logic should largely remain as is.

Parameters:
  • array (ndarray | Sequence) – array of float to convert and pack

  • signed (bool) – Whether the 4 bit variant is signed or unsigned

Returns:

Packed array with size ceil(array.size/2) (single dimension).

Return type:

ndarray