tensor_quant
Basic tensor quantization functions.
Classes
Dynamic block quantization functional. |
|
Fake version of affine quantization. |
|
Fake version of TensorQuantFunction use CUDA extension. |
|
Fake version of TensorQuantFunction. |
|
E4M3fy input with scale. |
|
A universal tensor quantization function. |
Functions
Implementation of fake quantizing input according to number of bits. |
|
Register an abstract implementation for quantizing tensor. |
|
Implementation of fake quantizing input to FP8. |
- class DynamicBlockQuantizationFunction
Bases:
Function
Dynamic block quantization functional.
- static backward(ctx, grad_outputs)
Implements straight through estimation with clipping.
- static forward(ctx, inputs, block_size, amax, num_bits, scale_bits)
Forward method.
- class FakeAffineTensorQuantFunction
Bases:
Function
Fake version of affine quantization.
gemmlowp style scale+shift quantization. See more details in https://github.com/google/gemmlowp/blob/master/doc/quantization.md.
We DO NOT recommend affine quantization on weights for performance reason. There might be value to affine quantize activation as it can be cancelled by bias and comes with no performance penalty. This functionality is only added for experimental purpose.
- static backward(ctx, grad_outputs)
Implements straight through estimation with clipping.
- Parameters:
ctx – Pytorch convention.
grad_output – A tensor of gradient of outputs.
- Returns:
A tensor of gradient
- Return type:
grad_inputs
- static forward(ctx, inputs, min_range, max_range, num_bits=8)
As it will be only applied on activation with per tensor granularity, broadcast is not needed.
- Parameters:
ctx – Pytorch convention.
inputs – A Tensor of type float32.
min_range – A float.
max_range – A float.
num_bits – An integer
- Returns:
A Tensor of type output_dtype
- Return type:
outputs
- class FakeTensorQuantFunction
Bases:
Function
Fake version of TensorQuantFunction use CUDA extension.
- static backward(ctx, grad_outputs)
Implements straight through estimation with clipping.
- static forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True, trt_high_precision_dtype='Float')
Forward method.
- static symbolic(g, inputs, amax, num_bits=8, unsigned=False, narrow_range=True, trt_high_precision_dtype='Float')
ONNX symbolic function.
- class LegacyFakeTensorQuantFunction
Bases:
Function
Fake version of TensorQuantFunction.
See comments of TensorQuantFunction, arguments are the same.
- static backward(ctx, grad_outputs)
Implements straight through estimation.
- static forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True)
Forward method.
- class ScaledE4M3Function
Bases:
Function
E4M3fy input with scale.
- static backward(ctx, grad_outputs)
Implements straight through estimation with clipping.
- static forward(ctx, inputs, amax, E, M, trt_high_precision_dtype='Float')
Forward method.
- static symbolic(g, inputs, amax=None, E=4, M=3, trt_high_precision_dtype='Float')
ONNX symbolic function.
- class TensorQuantFunction
Bases:
Function
A universal tensor quantization function.
Take an input tensor, output an quantized tensor. The granularity of scale can be interpreted from the shape of amax. output_dtype indicates whether the quantized value will be stored in integer or float. The reason we want to store it in float is the pytorch function takes the quantized value may not accept integer input, e.g. Conv2D.
It uses 2^num_bits -1 values instead of 2^num_bits. e.g., for num_bits=8, it uses [-127, 127] instead of [-128, 127]
- static backward(ctx, grad_outputs, grad_scale)
Implements straight through estimation with clipping.
For -amax <= input <= amax the gradient passes straight through, otherwise the gradient is zero.
- Parameters:
ctx – A Context object with saved tensors from forward.
grad_outputs – A tensor of gradient of outputs.
grad_scale – A tensor of gradient of scale.
- Returns:
A tensor of gradient.
- Return type:
grad_inputs
- static forward(ctx, inputs, amax, num_bits=8, unsigned=False, narrow_range=True, trt_high_precision_dtype='Float')
Forward method.
Follow tensorflow convention, max value is passed in and used to decide scale, instead of inputing scale directly. Though inputing scale directly may be more natural to use.
- Parameters:
ctx – A Context object to store tensors for backward.
inputs – A Tensor of type float32.
amax – A Tensor of type float32. Inputs will be quantized within range [-amax, amax] amax will be broadcasted to inputs tensor.
num_bits – A integer used to calculate scaling factor, scale = (2^(num_bits-1) - 1) / max Effectively, it indicates how many integer bits is used to represent the value. Default 8.
output_dtype – A type of Tensor. torch.int32 or torch.float32.
unsigned – A boolean. Use unsigned integer range. E.g. [0, 255] for num_bits=8. Default False.
narrow_range – A boolean. Use symmetric integer range for signed quantization E.g. [-127,127] instead of [-128,127] for num_bits=8. Default True.
- Returns:
A Tensor of type output_dtype. scale: A Tensor of type float32. outputs / scale will dequantize outputs tensor.
- Return type:
outputs
- Raises:
ValueError –
- static symbolic(g, inputs, amax, num_bits=8, unsigned=False, narrow_range=True, trt_high_precision_dtype='Float')
ONNX symbolic function.
- fake_quant_impl(inputs, amax, num_bits=8, unsigned=False, narrow_range=True)
Implementation of fake quantizing input according to number of bits.
- Parameters:
inputs (Tensor) –
amax (Tensor) –
- quantize_op_abstract(input, amax, num_bits=8, exponent_bits=0, unsigned=False, narrow_range=True)
Register an abstract implementation for quantizing tensor.
This abstract function returns an empty tensor with the same shape and dtype.
- Parameters:
input (Tensor) –
amax (Tensor) –
num_bits (int) –
exponent_bits (int) –
unsigned (bool) –
narrow_range (bool) –
- Return type:
Tensor
- scaled_e4m3_impl(inputs, amax, disable_fused_kernel=False)
Implementation of fake quantizing input to FP8.
- Parameters:
inputs (Tensor) – Torch tensor.
amax (Tensor) – Absolute max range of the input tensor.
- Returns:
Input tensors faked quantized to FP8.
- Return type:
Tensor