watersic_kv
WaterSIC KV-cache quantization algorithm.
Classes
Hook-based helper that captures Q/K activations and runs WaterSIC quantisation. |
|
Per-layer quantisation state produced by |
- ModeloptConfig WaterSICKVCalibConfig
Bases:
QuantizeAlgorithmConfigConfiguration for WaterSIC KV-cache quantization.
WaterSIC (Water-filling Successive Interference Cancellation) is a rate-adaptive quantization method for KV-cache compression. It applies the ZSIC algorithm with optional KL-aware importance weighting and LMMSE shrinkage correction to minimize attention-output distortion at a target bits-per-element budget.
Reference: “WaterSIC: Water-filling Successive Interference Cancellation for KV-Cache Quantization” (2024).
Show default config as JSON
- Default config (JSON):
{ "method": "watersic_kv", "moe_calib_experts_ratio": null, "use_sequential": false, "target_rate": 2.0, "kl_aware": false, "importance_clip": 50.0, "use_lmmse": true, "n_rescaler_iters": 0, "sample_frac": null }
- field importance_clip: float
Show details
Maximum ratio by which a single token’s importance weight may exceed the mean weight. Clips extreme outlier tokens to prevent them from dominating the Hessian estimate.
- Constraints:
gt = 0.0
- field kl_aware: bool
Show details
When True, per-token importance weights derived from the attention distribution are folded into the Hessian so that tokens with higher attention mass receive tighter quantization.
- field method: Literal['watersic_kv']
Show details
Fixed identifier for the WaterSIC KV-cache calibration method.
- field n_rescaler_iters: int
Show details
Number of coordinate-descent iterations for the diagonal rescaler that adjusts per-column scale factors after LMMSE. Set to 0 to disable the rescaler (faster but slightly higher distortion).
- Constraints:
ge = 0
- field sample_frac: float | None
Show details
If set, only this fraction of rows (KV heads) are used during the binary search for c. Full rows are then quantized with the found c. Speeds up calibration on large KV caches at a small accuracy cost.
- field target_rate: float
Show details
Average number of bits per quantized KV-cache element. The binary search over the ZSIC damping parameter c is driven to hit this rate.
- Constraints:
gt = 0.0
- field use_lmmse: bool
Show details
When True, the LMMSE (Linear Minimum Mean-Squared Error) shrinkage correction is applied after ZSIC quantization to partially undo quantization bias and reduce reconstruction NMSE.
- field use_sequential: bool
Show details
Must be False for WaterSIC. Unlike weight quantization, KV-cache quantization does not have progressive error accumulation between layers, so sequential calibration is not needed.
- class WaterSICKVHelper
Bases:
objectHook-based helper that captures Q/K activations and runs WaterSIC quantisation.
Usage:
helper = WaterSICKVHelper(quant_attn_module, "layer.3") helper.setup() # ... run calibration forward passes ... state = helper.quantize(target_rate=4.0) helper.cleanup() helper.free()
- __init__(module, name, kl_aware=False, importance_clip=50.0)
Initialize helper for a single attention module.
- Parameters:
name (str)
kl_aware (bool)
importance_clip (float)
- cleanup()
Remove the instance-level override, restoring the class staticmethod.
- free()
Release collected calibration data.
- quantize(target_rate=4.0, use_lmmse=True, n_rescaler_iters=0, sample_frac=None)
Run WaterSIC quantisation on the collected key activations.
- Parameters:
target_rate (float) – Target coding rate in bits per element.
use_lmmse (bool) – Whether to apply LMMSE gain correction.
n_rescaler_iters (int) – Number of alternating rescaler iterations (0 = disable).
sample_frac (float) – Fraction of rows used by
binary_search_c().Returns
-------
WaterSICKVState
- Return type:
- setup()
Patch
_quantized_attentionon the module instance to capture Q/K.
- class WaterSICKVState
Bases:
objectPer-layer quantisation state produced by
WaterSICKVHelper.quantize().- Z: Tensor
Integer code-book indices.
- __init__(Z, alpha, gamma, perm, rate)
- Parameters:
Z (Tensor)
alpha (Tensor)
gamma (Tensor)
perm (Tensor | None)
rate (float)
- Return type:
None
- alpha: Tensor
Per-column step sizes.
- gamma: Tensor
Per-column LMMSE gains.
- perm: Tensor | None
Column permutation (or None).
- rate: float
Achieved coding rate (bits per element).