quant_module

Base class for quantization modules.

Classes

QuantInputBase

Base class for modules where the input is quantized.

QuantLinearConvBase

Base class for quantized linear modules.

QuantModule

A base class for quantized modules.

class QuantInputBase

Bases: QuantModule

Base class for modules where the input is quantized.

default_quant_desc_input = QuantizerAttributeConfig(enable=True, num_bits=8, axis=None, fake_quant=True, unsigned=False, narrow_range=False, learn_amax=False, type='static', block_sizes=None, bias=None, trt_high_precision_dtype='Float', calibrator='max', rotate=False, pass_through_bwd=False, backend=None, backend_extra_args=None)
default_quant_desc_output = QuantizerAttributeConfig(enable=True, num_bits=8, axis=None, fake_quant=True, unsigned=False, narrow_range=False, learn_amax=False, type='static', block_sizes=None, bias=None, trt_high_precision_dtype='Float', calibrator='max', rotate=False, pass_through_bwd=False, backend=None, backend_extra_args=None)
forward(input, *args, **kwargs)

Quantize the input before calling the original forward method.

input_quantizer: TensorQuantizer
output_quantizer: TensorQuantizer
class QuantLinearConvBase

Bases: QuantInputBase

Base class for quantized linear modules.

Quantized linear modules are modules where both the input and the weight are quantized.

default_quant_desc_weight = QuantizerAttributeConfig(enable=True, num_bits=8, axis=None, fake_quant=True, unsigned=False, narrow_range=False, learn_amax=False, type='static', block_sizes=None, bias=None, trt_high_precision_dtype='Float', calibrator='max', rotate=False, pass_through_bwd=False, backend=None, backend_extra_args=None)
forward(input, *args, **kwargs)

Quantize the input and the weight before calling the original forward method.

quantize_weight()

Context in which self.weight is quantized.

weight_quantizer: TensorQuantizer | SequentialQuantizer
class QuantModule

Bases: DynamicModule

A base class for quantized modules.

In addition, the class also provides parallel_state attribute that can be used to access the parallel state of the module.

classmethod convert(module, **setup_kwargs)

Convert the module to a dynamic module.

Parameters:
  • module (Module)

  • setup_kwargs (Any)

Return type:

QuantModule

fold_weight()

Fold the weight for faster eval.

modelopt_post_restore(prefix='')

Post-restore to correctly configure the TensorQuantizer states.

TensorQuantizer states are restored to their shape before saving. Now we need to further configure them.
  1. For non-sharded modules this simply involves moving the TensorQuantizer states to the right device.

    This applies for regular Pytorch models and HuggingFace models.

  2. For sharded modules the restored states of TensorQuantizer could be incorrect. This is because

    parallelism such as TP might have been changed between saving and resoring. So we need to re-calculate the state shapes. Hence such modules should override this and implement their own logic.

Parameters:

prefix (str)

property parallel_state: ParallelState | None

Return the parallel state of the quant module.