watersic_kv

WaterSIC KV-cache quantization algorithm.

Classes

`WaterSICKVHelper`	Hook-based helper that captures Q/K activations and runs WaterSIC quantisation.
`WaterSICKVState`	Per-layer quantisation state produced by `WaterSICKVHelper.quantize()`.

ModeloptConfig WaterSICKVCalibConfig

Bases: QuantizeAlgorithmConfig

Configuration for WaterSIC KV-cache quantization.

WaterSIC (Water-filling Successive Interference Cancellation) is a rate-adaptive quantization method for KV-cache compression. It applies the ZSIC algorithm with optional KL-aware importance weighting and LMMSE shrinkage correction to minimize attention-output distortion at a target bits-per-element budget.

Reference: “WaterSIC: Water-filling Successive Interference Cancellation for KV-Cache Quantization” (2024).

Show default config as JSON

Default config (JSON):

{
   "method": "watersic_kv",
   "moe_calib_experts_ratio": null,
   "use_sequential": false,
   "target_rate": 2.0,
   "kl_aware": false,
   "importance_clip": 50.0,
   "use_lmmse": true,
   "n_rescaler_iters": 0,
   "sample_frac": null
}

field importance_clip: float

Show details

Maximum ratio by which a single token’s importance weight may exceed the mean weight. Clips extreme outlier tokens to prevent them from dominating the Hessian estimate.

Constraints:

gt = 0.0

field kl_aware: bool

Show details

When True, per-token importance weights derived from the attention distribution are folded into the Hessian so that tokens with higher attention mass receive tighter quantization.

field method: Literal['watersic_kv']

Show details

Fixed identifier for the WaterSIC KV-cache calibration method.

field n_rescaler_iters: int

Show details

Number of coordinate-descent iterations for the diagonal rescaler that adjusts per-column scale factors after LMMSE. Set to 0 to disable the rescaler (faster but slightly higher distortion).

Constraints:

ge = 0

field sample_frac: float | None

Show details

If set, only this fraction of rows (KV heads) are used during the binary search for c. Full rows are then quantized with the found c. Speeds up calibration on large KV caches at a small accuracy cost.

field target_rate: float

Show details

Average number of bits per quantized KV-cache element. The binary search over the ZSIC damping parameter c is driven to hit this rate.

Constraints:

gt = 0.0

field use_lmmse: bool

Show details

When True, the LMMSE (Linear Minimum Mean-Squared Error) shrinkage correction is applied after ZSIC quantization to partially undo quantization bias and reduce reconstruction NMSE.

field use_sequential: bool

Show details

Must be False for WaterSIC. Unlike weight quantization, KV-cache quantization does not have progressive error accumulation between layers, so sequential calibration is not needed.

class WaterSICKVHelper

Bases: object

Hook-based helper that captures Q/K activations and runs WaterSIC quantisation.

Usage:

helper = WaterSICKVHelper(quant_attn_module, "layer.3")
helper.setup()
# ... run calibration forward passes ...
state = helper.quantize(target_rate=4.0)
helper.cleanup()
helper.free()

__init__(module, name, kl_aware=False, importance_clip=50.0)

Initialize helper for a single attention module.

Parameters:

name (str)
kl_aware (bool)
importance_clip (float)

cleanup(): Remove the instance-level override, restoring the class staticmethod.

free(): Release collected calibration data.

quantize(target_rate=4.0, use_lmmse=True, n_rescaler_iters=0, sample_frac=None)

Run WaterSIC quantisation on the collected key activations.

Parameters:

target_rate (float) – Target coding rate in bits per element.
use_lmmse (bool) – Whether to apply LMMSE gain correction.
n_rescaler_iters (int) – Number of alternating rescaler iterations (0 = disable).
sample_frac (float) – Fraction of rows used by binary_search_c().
Returns
-------
WaterSICKVState

Return type:

WaterSICKVState

setup(): Patch _quantized_attention on the module instance to capture Q/K.

class WaterSICKVState

Bases: object

Per-layer quantisation state produced by WaterSICKVHelper.quantize().

Z: Tensor: Integer code-book indices.

__init__(Z, alpha, gamma, perm, rate)

Parameters:

Z (Tensor)
alpha (Tensor)
gamma (Tensor)
perm (Tensor | None)
rate (float)

Return type:

None

alpha: Tensor: Per-column step sizes.

gamma: Tensor: Per-column LMMSE gains.

perm: Tensor | None: Column permutation (or None).

rate: float: Achieved coding rate (bits per element).