utils

Quantization utilities.

Functions

`convert_quantization_axis_to_reduce_axis`	Convert the quantization axis to the reduce axis.
`export_torch_mode`	Context manager enabling the export mode.
`is_quantized`	Check if a module is quantized.
`is_quantized_column_parallel_linear`	Check if a module is a quantized column parallel linear module.
`is_quantized_linear`	Check if a module is a quantized linear module.
`is_quantized_row_parallel_linear`	Check if a module is a quantized row parallel linear module.
`reduce_amax`	Compute the absolute maximum value of a tensor.
`reduce_sum`	Compute the sum of a tensor along specified axes.
`replace_function`	Replace a function with a new one within a context.
`representative_weight_quantizer`	Return the representative weight quantizer for `weight_name` on `module`.
`update_quant_cfg_with_kv_cache_quant`	Update the quant_cfg with the kv cache quant_cfg.
`weight_attr_names`	Get the weight param attribute names in a converted module, non-recursive.

convert_quantization_axis_to_reduce_axis(input, axis)

Convert the quantization axis to the reduce axis.

Parameters:

input (torch.Tensor) – The input tensor.
axis (int, tuple, list of None) – The quantization axis. None means per-tensor quantization.

Returns:

The axis to reduce. None suggests all dimensions should be reduced.

Return type:

list

export_torch_mode(): Context manager enabling the export mode.

is_quantized(module): Check if a module is quantized.

is_quantized_column_parallel_linear(module): Check if a module is a quantized column parallel linear module.

is_quantized_linear(module): Check if a module is a quantized linear module.

is_quantized_row_parallel_linear(module): Check if a module is a quantized row parallel linear module.

reduce_amax(input, axis=None, keepdims=True, squeeze_scalar=True)

Compute the absolute maximum value of a tensor.

Reduces input_tensor along the dimensions given in axis. Unless keepdims is true, the rank of the tensor is reduced by 1 for each entry in axis. If keepdims is true, the reduced dimensions are retained with length 1.

Note

Gradient computation is disabled as this function is never meant learning reduces amax

Parameters:

input – Input tensor
axis – The dimensions to reduce. None or int or tuple of ints. If None (the default), reduces all dimensions. Must be in the range [-rank(input_tensor), rank(input_tensor)).
keepdims – A boolean. If true, retains reduced dimensions with length 1. Default True

Returns:

The reduced tensor.

reduce_sum(input, axis=None, keepdims=True)

Compute the sum of a tensor along specified axes.

Reduces input_tensor along the dimensions given in axis. Unless keepdims is true, the rank of the tensor is reduced by 1 for each entry in axis. If keepdims is true, the reduced dimensions are retained with length 1.

Note

Gradient computation is disabled as this function is never meant for learning.

Parameters:

input – Input tensor
axis – The dimensions to reduce. None or int or tuple of ints. If None (the default), reduces all dimensions. Must be in the range [-rank(input_tensor), rank(input_tensor)).
keepdims – A boolean. If true, retains reduced dimensions with length 1. Default True

Returns:

The reduced tensor.

replace_function(package, name, new_func, og_func_cache_name=None): Replace a function with a new one within a context.

representative_weight_quantizer(module, weight_name='weight')

Return the representative weight quantizer for weight_name on module.

Handles two layouts:

singular <name>_weight_quantizer — standard nn.Linear / _QuantLinear.
plural <name>_weight_quantizers (nn.ModuleList) — fused-experts modules (_QuantFusedExperts) hold one TensorQuantizer per expert. Per-expert formats are identical, so the first element is representative.

Returns None if no matching quantizer is found.

Parameters:

module (Module)
weight_name (str)

update_quant_cfg_with_kv_cache_quant(quant_cfg, kv_cache_quant_cfg)

Update the quant_cfg with the kv cache quant_cfg.

Parameters:

quant_cfg (dict[str, Any]) – The outer quantization config dict (with "quant_cfg" and "algorithm" keys).
kv_cache_quant_cfg (list[QuantizerCfgEntry]) – A list of QuantizerCfgEntry dicts for KV cache quantization, typically some_kv_cfg["quant_cfg"].

Returns:

A deep copy of quant_cfg with the KV cache entries appended to quant_cfg["quant_cfg"].

Return type:

dict[str, Any]

weight_attr_names(module)

Get the weight param attribute names in a converted module, non-recursive.

Covers three layouts:

standard nn.Linear: weight + weight_quantizer.
custom per-weight quantizer (e.g. Llama4TextExperts with gate_up_proj + gate_up_proj_weight_quantizer).
fused-experts nn.ModuleList quantizers (_QuantFusedExperts with gate_up_proj + gate_up_proj_weight_quantizers plural list).

Parameters:: module (Module)
Return type:: Generator[str, None, None]