tensor_quant

Basic tensor quantization functions.

Classes

FakeAffineTensorQuantFunction

Fake version of affine quantization.

FakeTensorQuantFunction

Fake version of TensorQuantFunction use CUDA extension.

LegacyFakeTensorQuantFunction

Fake version of TensorQuantFunction.

QuantDescriptor

alias of ScaledQuantDescriptor

ScaledE4M3Function

E4M3fy input with scale.

ScaledQuantDescriptor

Supportive descriptor of quantization.

TensorQuantFunction

A universal tensor quantization function.

Functions

scaled_e4m3_abstract

Register an abstract implementation for scaled_e4m3.

class FakeAffineTensorQuantFunction

Bases: Function

Fake version of affine quantization.

gemmlowp style scale+shift quantization. See more details in https://github.com/google/gemmlowp/blob/master/doc/quantization.md.

We DO NOT recommend affine quantization on weights for performance reason. There might be value to affine quantize activation as it can be cancelled by bias and comes with no performance penalty. This functionality is only added for experimental purpose.

static backward(ctx, grad_outputs)

Implements straight through estimation with clipping.

Parameters:
  • ctx – Pytorch convention.

  • grad_output – A tensor of gradient of outputs.

Returns:

A tensor of gradient

Return type:

grad_inputs

static forward(ctx, inputs, min_range, max_range, num_bits=8)

As it will be only applied on activation with per tensor granularity, broadcast is not needed.

Parameters:
  • ctx – Pytorch convention.

  • inputs – A Tensor of type float32.

  • min_range – A float.

  • max_range – A float.

  • num_bits – An integer

Returns:

A Tensor of type output_dtype

Return type:

outputs

class FakeTensorQuantFunction

Bases: Function

Fake version of TensorQuantFunction use CUDA extension.

static backward(ctx, grad_outputs)

Implements straight through estimation with clipping.

static forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True)

Forward method.

static symbolic(g, inputs, amax, num_bits=8, unsigned=False, narrow_range=True)

ONNX symbolic function.

class LegacyFakeTensorQuantFunction

Bases: Function

Fake version of TensorQuantFunction.

See comments of TensorQuantFunction, arguments are the same.

static backward(ctx, grad_outputs)

Implements straight through estimation.

static forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True)

Forward method.

QuantDescriptor

alias of ScaledQuantDescriptor

class ScaledE4M3Function

Bases: Function

E4M3fy input with scale.

static backward(ctx, grad_outputs)

Implements straight through estimation with clipping.

static forward(ctx, inputs, amax, E, M)

Forward method.

static symbolic(g, inputs, amax=None, E=4, M=3)

ONNX symbolic function.

class ScaledQuantDescriptor

Bases: object

Supportive descriptor of quantization.

Describe how a tensor should be quantized. A QuantDescriptor and a tensor defines a quantized tensor.

Parameters:
  • num_bits

    An integer or a tuple of two integers. Specifically, num_bits can be:

    1. A positive integer argument for integer quantization. num_bits specify

      the number of bits used for integer quantization.

    2. Constant integer tuple (E,M) for floating point quantization emulating

      Nvidia’s FPx quantization. E is the number of exponent bits and M is the number of mantissa bits. Supported FPx quantizations: FP8 with (E=4, M=3).

    Default: 8.

  • name – Seems a nice thing to have

  • fake_quant – A boolean. If True, use fake quantization mode. Default True.

  • axis – None, int or tuple of int. The specified axis/axes will have its own amax for computing scaling factor. If None (the default), use per tensor scale. Must be in the range [-rank(input_tensor), rank(input_tensor)). E.g. For a KCRS weight tensor, quant_axis=(0) will yield per channel scaling.

  • block_sizes

    None or a dictionary. The dictionary specifies block quantization parameters. The keys are the axes for block quantization and the values are block sizes for quantization along the respective axes. Keys must be in the range [-rank(input_tensor), rank(input_tensor)]. Values, which are the block sizes for quantization must be positive integers.

    In addition, there can be special string keys “type” and “scale_bits”. Key “type” should map to “dynamic” or “static” where “dynamic” indicates dynamic block quantization and “static” indicates static calibrated block quantization. By default, the type is “static”. Key “scale_bits” specify the quantization bits for the per-block quantization scale factor (i.e a double quantization scheme). Key “scale_block_sizes” specify the block size for double quantization. By default per-block quantization scale is not quantized.

    For example, block_sizes = {-1: 32} will quantize the last axis of the input tensor in blocks of size 32 with static calibration and block_sizes = {-1: 32, "type": "dynamic"} will perform dynamic block quantization. If None, block quantization is not performed. axis must be None when block_sizes is not None.

  • amax – A float or list/ndarray of floats of user specified absolute max range. If supplied, ignore quant_axis and use this to quantize. If learn_amax is True, will be used to initialize learnable amax.

  • learn_amax – A boolean. If True, learn amax.

  • scale_amax – A float. If supplied, multiply amax by scale_amax. Default None. It is useful for some quick experiment.

  • calib_method – A string. One of ["max", "histogram"] indicates which calibration to use. Except the simple max calibration, other methods are all histogram based.

  • unsigned – A boolean. If True, use unsigned.

  • narrow_range – A boolean. if True, symmetric integer range for signed quantization is used.

Read-only properties:
  • fake_quant:

  • name:

  • learn_amax:

  • scale_amax:

  • axis:

  • calib_method:

  • num_bits:

  • amax:

  • unsigned:

__init__(num_bits=8, name=None, fake_quant=True, axis=None, block_sizes=None, amax=None, learn_amax=False, scale_amax=None, calib_method='max', unsigned=False, narrow_range=False, dynamic=False)

Initialize QuantDescriptor.

property amax

Return amax.

property axis

Return axis for quantization.

property block_sizes

Return block_sizes for quantization.

property calib_method

Return calibration method.

dict()

Serialize to dict.

The build-in __dict__ method returns all the attributes, which includes those have default value and have protected prefix “_”. This method only returns those have values other than the default one and don’t have _ in key. Construct a instance by dict returned by this method should get exactly the same instance.

property dynamic

Returns True if the quantization is dynamic.

property fake_quant

Return True if fake quantization is used.

static get_block_quant_axes_and_sizes(block_sizes)

Return axes and sizes for block quantization.

Parameters:

block_sizes (dict) –

property learn_amax

Return True if amax is learnable.

property name

Return name.

property narrow_range

Return True if symmetric integer range for signed quantization is used.

property num_bits

Return num_bits.

property scale_amax

Return scale_amax.

property unsigned

Return True if unsigned integer range is used.

class TensorQuantFunction

Bases: Function

A universal tensor quantization function.

Take an input tensor, output an quantized tensor. The granularity of scale can be interpreted from the shape of amax. output_dtype indicates whether the quantized value will be stored in integer or float. The reason we want to store it in float is the pytorch function takes the quantized value may not accept integer input, e.g. Conv2D.

It uses 2^num_bits -1 values instead of 2^num_bits. e.g., for num_bits=8, it uses [-127, 127] instead of [-128, 127]

static backward(ctx, grad_outputs, grad_scale)

Implements straight through estimation with clipping.

For -amax <= input <= amax the gradient passes straight through, otherwise the gradient is zero.

Parameters:
  • ctx – A Context object with saved tensors from forward.

  • grad_outputs – A tensor of gradient of outputs.

  • grad_scale – A tensor of gradient of scale.

Returns:

A tensor of gradient.

Return type:

grad_inputs

static forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True)

Forward method.

Follow tensorflow convention, max value is passed in and used to decide scale, instead of inputing scale directly. Though inputing scale directly may be more natural to use.

Parameters:
  • ctx – A Context object to store tensors for backward.

  • inputs – A Tensor of type float32.

  • amax – A Tensor of type float32. Inputs will be quantized within range [-amax, amax] amax will be broadcasted to inputs tensor.

  • num_bits – A integer used to calculate scaling factor, scale = (2^(num_bits-1) - 1) / max Effectively, it indicates how many integer bits is used to represent the value. Default 8.

  • output_dtype – A type of Tensor. torch.int32 or torch.float32.

  • unsigned – A boolean. Use unsigned integer range. E.g. [0, 255] for num_bits=8. Default False.

  • narrow_range – A boolean. Use symmetric integer range for signed quantization E.g. [-127,127] instead of [-128,127] for num_bits=8. Default True.

Returns:

A Tensor of type output_dtype. scale: A Tensor of type float32. outputs / scale will dequantize outputs tensor.

Return type:

outputs

Raises:

ValueError

static symbolic(g, inputs, amax, num_bits=8, unsigned=False, narrow_range=True)

ONNX symbolic function.

scaled_e4m3_abstract(input, amax)

Register an abstract implementation for scaled_e4m3.

This abstract function returns an empty tensor with the same shape and dtype.

Parameters:
  • input (Tensor) –

  • amax (Tensor) –

Return type:

Tensor