utils
Quantization utilities.
Classes
Base class for shared quantization state owned by a group parent. |
|
Canonical shared weight |
Functions
Convert the quantization axis to the reduce axis. |
|
Context manager enabling the export mode. |
|
Find fusible sibling groups by regex over module FQNs; capture groups define the key. |
|
Check if a module is quantized. |
|
Check if a module is a quantized column parallel linear module. |
|
Check if a module is a quantized linear module. |
|
Check if a module is a quantized row parallel linear module. |
|
Yield shared quant states owned within |
|
Compute the absolute maximum value of a tensor. |
|
Compute the sum of a tensor along specified axes. |
|
Replace a function with a new one within a context. |
|
Return the representative weight quantizer for |
|
Update the quant_cfg with the kv cache quant_cfg. |
|
Get the weight param attribute names in a converted module, non-recursive. |
Bases:
Module,ABCBase class for shared quantization state owned by a group parent.
Subclasses define when and how their canonical state is initialized. Runtime states can override
install_hooks()to compute/cache group-level values at the parent instead of doing the same work in every member.Initialize an empty shared-state owner.
- Return type:
None
Create this state on each discovered group’s parent.
- Parameters:
model (Module)
patterns (Sequence[str] | None)
- Return type:
int
Whether the managed buffer(s) are populated; the finalize hook-produced states inherit.
A state whose value is produced during the forward (e.g. the shared input-amax state’s parent hook) needs only this readiness gate — the value already exists by now. States that produce on demand override it: weight aggregates member
_amax, SVDQuant runs an SVD.populate()skips a state whosefinalizereturnsFalse(uncalibrated / meta / forward never ran).- Return type:
bool
Install parent/member hooks for runtime shared computation.
- Return type:
None
Return linked member modules.
Return restore metadata for this state when present in
model.- Parameters:
model (Module)
- Return type:
dict[str, bool]
Finalize and sync every state of this type in
model; return the count populated.- Parameters:
model (Module)
- Return type:
int
Remove hooks installed by
install_hooks().- Return type:
None
Resolve the max-calibration config into grouping patterns for this state.
- Parameters:
shared_states (Mapping[str, Mapping[str, Sequence[str]]] | None)
- Return type:
list[str]
Re-attach states and rebuild member aliases from members’ restored buffers.
- Parameters:
model (Module)
patterns (Sequence[str] | None)
- Return type:
None
Rebuild the canonical buffer from members’ restored buffers and re-tie.
Used only on checkpoint restore: the state is non-persistent, so it is absent until rebuilt here from the members’ (persistent, just-loaded) buffers.
- Return type:
bool
Set the owning parent and linked member modules.
- Parameters:
parent (Module)
members (Sequence[Module])
- Return type:
None
Synchronize canonical state across distributed process groups.
- Parameters:
parallel_state (ParallelState | None)
- Return type:
None
Alias a member quantizer’s managed buffers to this state’s canonical buffers.
For each managed attr, point
quantizer._buffers[attr]at the same tensor object asself.<attr>(register it if absent, else replace) so the member and the state share one storage, not a copy. Records the attr in the quantizer’s_shared_quant_tied_attrssoTensorQuantizer.__setattr__rejects a later rebind. Returns whether anything was tied.- Parameters:
quantizer (Module)
- Return type:
bool
Tie all eligible member quantizers to the canonical state buffers.
- Return type:
None
Bases:
SharedQuantStateCanonical shared weight
global_amaxfor one fusible sibling group.Initialize with an unset canonical
global_amaxbuffer.- Return type:
None
Set
global_amaxto the max over members’ calibrated_amax.- Return type:
bool
Return the canonical shared global amax.
All-reduce (MAX)
global_amaxacross EP, plus TP defensively.- Parameters:
parallel_state (ParallelState | None)
- Return type:
None
Tie one member quantizer to the shared
_global_amaxbuffer when eligible.- Parameters:
quantizer (Module)
- Return type:
bool
- convert_quantization_axis_to_reduce_axis(input, axis)
Convert the quantization axis to the reduce axis.
- Parameters:
input (torch.Tensor) – The input tensor.
axis (int, tuple, list of None) – The quantization axis. None means per-tensor quantization.
- Returns:
The axis to reduce. None suggests all dimensions should be reduced.
- Return type:
list
- export_torch_mode()
Context manager enabling the export mode.
Find fusible sibling groups by regex over module FQNs; capture groups define the key.
Each pattern is
re.fullmatch-ed against every quantized module’s fully-qualified name; modules whose match yields the same capture-group tuple form one group, parented at their LCA. Granularity is set by what you capture:Capture the immediate parent -> per-parent grouping: q/k/v per attention block, and per-expert
w1/w3(each expert is the immediate parent), e.g.r"(.*)\.(?:w1|w3)$".Capture only a level above the expert index, leaving the index uncaptured -> one cross-expert group, e.g.
r"(.*)\.experts\.\d+\.(?:w1|w3)$".
Roles to fuse together go in a non-capturing alternation
(?:w1|w3)so they don’t split the key; what you wrap in(...)is the group boundary. PassSHARED_PATTERNSfor the standard q/k/v + gate/up groups, or override viaMaxCalibConfig.shared_states. The caller selects which quantizer these groups apply to. Returns(parent, members)tuples; empty when no patterns are given.- Parameters:
model (Module)
patterns (Sequence[str] | None)
target_quantizer_kind (str)
- Return type:
list[tuple[Module, list[Module]]]
- is_quantized(module)
Check if a module is quantized.
- is_quantized_column_parallel_linear(module)
Check if a module is a quantized column parallel linear module.
- is_quantized_linear(module)
Check if a module is a quantized linear module.
- is_quantized_row_parallel_linear(module)
Check if a module is a quantized row parallel linear module.
Yield shared quant states owned within
model.- Parameters:
model (Module)
state_cls (type[SharedQuantState])
- reduce_amax(input, axis=None, keepdims=True, squeeze_scalar=True)
Compute the absolute maximum value of a tensor.
Reduces input_tensor along the dimensions given in axis. Unless keepdims is true, the rank of the tensor is reduced by 1 for each entry in axis. If keepdims is true, the reduced dimensions are retained with length 1.
Note
Gradient computation is disabled as this function is never meant learning reduces amax
- Parameters:
input – Input tensor
axis – The dimensions to reduce. None or int or tuple of ints. If None (the default), reduces all dimensions. Must be in the range [-rank(input_tensor), rank(input_tensor)).
keepdims – A boolean. If true, retains reduced dimensions with length 1. Default True
- Returns:
The reduced tensor.
- reduce_sum(input, axis=None, keepdims=True)
Compute the sum of a tensor along specified axes.
Reduces input_tensor along the dimensions given in axis. Unless keepdims is true, the rank of the tensor is reduced by 1 for each entry in axis. If keepdims is true, the reduced dimensions are retained with length 1.
Note
Gradient computation is disabled as this function is never meant for learning.
- Parameters:
input – Input tensor
axis – The dimensions to reduce. None or int or tuple of ints. If None (the default), reduces all dimensions. Must be in the range [-rank(input_tensor), rank(input_tensor)).
keepdims – A boolean. If true, retains reduced dimensions with length 1. Default True
- Returns:
The reduced tensor.
- replace_function(package, name, new_func, og_func_cache_name=None)
Replace a function with a new one within a context.
- representative_weight_quantizer(module, weight_name='weight')
Return the representative weight quantizer for
weight_nameonmodule.Handles two layouts:
singular
<name>_weight_quantizer— standardnn.Linear/_QuantLinear.plural
<name>_weight_quantizers(nn.ModuleList) — fused-experts modules (_QuantFusedExperts) hold oneTensorQuantizerper expert. Per-expert formats are identical, so the first element is representative.
Returns
Noneif no matching quantizer is found.- Parameters:
module (Module)
weight_name (str)
- update_quant_cfg_with_kv_cache_quant(quant_cfg, kv_cache_quant_cfg)
Update the quant_cfg with the kv cache quant_cfg.
- Parameters:
quant_cfg (dict[str, Any]) – The outer quantization config dict (with
"quant_cfg"and"algorithm"keys).kv_cache_quant_cfg (list[QuantizerCfgEntry]) – A list of
QuantizerCfgEntrydicts for KV cache quantization, typicallysome_kv_cfg["quant_cfg"].
- Returns:
A deep copy of
quant_cfgwith the KV cache entries appended toquant_cfg["quant_cfg"].- Return type:
dict[str, Any]
- weight_attr_names(module)
Get the weight param attribute names in a converted module, non-recursive.
Covers three layouts:
standard
nn.Linear:weight+weight_quantizer.custom per-weight quantizer (e.g.
Llama4TextExpertswithgate_up_proj+gate_up_proj_weight_quantizer).fused-experts
nn.ModuleListquantizers (_QuantFusedExpertswithgate_up_proj+gate_up_proj_weight_quantizersplural list).
- Parameters:
module (Module)
- Return type:
Generator[str, None, None]