quant_utils
Utils for quantization including scaling factors adjustments.
Functions
Adjusts the amax values for the attention layers. |
|
Checks if all elements in the provided list are the same. |
|
Converts the quantized weight to the target torch_dtype format. |
|
Returns the activation scaling factor. |
|
Returns the kv_cache dtype. |
|
Returns the kv_cache scaling factor if output quantizer is set. |
|
Returns the prequant scaling factor. |
|
Gets the quantization string. |
|
Returns scaling factor from the quantizer as torch.Tensor. |
|
Calculate the weight scaling factor for a given group size. |
|
Returns the weight block size. |
|
Returns the weight scaling factor. |
|
Returns the secondary weight scaling factor. |
|
Filters out keys related to weight quantizers and updates KV cache related keys. |
|
Preprocess the quantized linears that we plan to fuse. |
|
Processes per layer quantization information for TRTLLM export to quant_cfg.json. |
|
Resmooths weights from a single or multiple ranks and get scaling factors and amax. |
|
Converts the weight to the quantized (packed) format. |
- adjust_attn_amax_values(module)
Adjusts the amax values for the attention layers.
- all_items_same(item_list)
Checks if all elements in the provided list are the same.
- from_quantized_weight(weight, weights_scaling_factor, quantization, torch_dtype)
Converts the quantized weight to the target torch_dtype format.
- Parameters:
weight (Tensor) –
weights_scaling_factor (Tensor) –
quantization (str) –
- get_activation_scaling_factor(module)
Returns the activation scaling factor.
- Parameters:
module (Module) –
- Return type:
Tensor
- get_kv_cache_dtype(modules)
Returns the kv_cache dtype.
If num_bits of output_quantizer is (4, 3) then returns FP8; if it is 8, returns int8, otherwise returns None.
- Parameters:
modules (Union[list[nn.Module], nn.Module]) – The module or list of modules to inspect.
- Returns:
The kv_cache dtype.
- Return type:
str
- get_kv_cache_scaling_factor(qkv_modules)
Returns the kv_cache scaling factor if output quantizer is set. Else returns None by default.
- Parameters:
qkv_modules (list[Module]) –
- Return type:
Tensor
- get_prequant_scaling_factor(module)
Returns the prequant scaling factor.
- Parameters:
module (Module) –
- Return type:
Tensor
- get_quantization_format(module)
Gets the quantization string.
Gets the quantization string by iterating through the module and its children. The first non-None quantization string is returned.
- Return type:
str | None
- get_scaling_factor(quantizer)
Returns scaling factor from the quantizer as torch.Tensor.
- Parameters:
quantizer (TensorQuantizer) –
- Return type:
Tensor
- get_scaling_factor_from_weight(weight, group_size)
Calculate the weight scaling factor for a given group size.
- Return type:
tensor
- get_weight_block_size(module)
Returns the weight block size.
- Parameters:
module (Module) –
- Return type:
int
- get_weight_scaling_factor(module)
Returns the weight scaling factor.
- Parameters:
module (Module) –
- Return type:
Tensor
- get_weight_scaling_factor_2(module)
Returns the secondary weight scaling factor.
- Parameters:
module (Module) –
- Return type:
Tensor
- postprocess_state_dict(state_dict, maxbound, quantization)
Filters out keys related to weight quantizers and updates KV cache related keys.
- Parameters:
state_dict (dict) – The full model state_dict.
maxbound (float) – The maximum bound value for the output quantizer.
quantization (Optional[str]) – The KV cache quantization format.
- Returns:
Filtered state_dict without unnecessary keys like ‘_amax’ and non KV cache output quantizers.
- Return type:
dict
- preprocess_linear_fusion(modules, resmooth_only=False)
Preprocess the quantized linears that we plan to fuse.
Use resmooth_only for MOE experts as each individual expert is not fused.
- Parameters:
modules (list[Module]) –
- process_layer_quant_config(layer_config_dict)
Processes per layer quantization information for TRTLLM export to quant_cfg.json.
- resmooth_and_get_scale(merged_weights, pre_quant_scales, ranks, group_size, new_pre_quant_scale=None, quantization=None)
Resmooths weights from a single or multiple ranks and get scaling factors and amax.
- Parameters:
merged_weights (Tensor) – Merged weights from ranks.
pre_quant_scales (list[Tensor]) – List of pre-quantization scales for each rank.
ranks (int) – Number of ranks.
group_size (int) – Group size of the quantization block.
new_pre_quant_scale (optional) – If not provided, weights will be resmoothed using the average of pre_quant_scales.
quantization (str | None) –
- Returns:
Resmoothed weights. weight_scaling_factors: Resmoothed scaling factors. avg_pre_quant_scale: Calculated average of the quantization scale.
- Return type:
weights
- to_quantized_weight(weight, weights_scaling_factor, quantization, weights_scaling_factor2=None, block_size=None)
Converts the weight to the quantized (packed) format.
- Parameters:
weight (Tensor) –
weights_scaling_factor (Tensor) –
quantization (str) –
weights_scaling_factor2 (Tensor | None) –
block_size (int | None) –