watersic_kv

WaterSIC KV-cache quantization algorithm.

Classes

WaterSICKVHelper

Hook-based helper that captures Q/K activations and runs WaterSIC quantisation.

WaterSICKVState

Per-layer quantisation state produced by WaterSICKVHelper.quantize().

ModeloptConfig WaterSICKVCalibConfig

Bases: QuantizeAlgorithmConfig

Configuration for WaterSIC KV-cache quantization.

WaterSIC (Water-filling Successive Interference Cancellation) is a rate-adaptive quantization method for KV-cache compression. It applies the ZSIC algorithm with optional KL-aware importance weighting and LMMSE shrinkage correction to minimize attention-output distortion at a target bits-per-element budget.

Reference: “WaterSIC: Water-filling Successive Interference Cancellation for KV-Cache Quantization” (2024).

Show default config as JSON
Default config (JSON):

{
   "method": "watersic_kv",
   "moe_calib_experts_ratio": null,
   "use_sequential": false,
   "target_rate": 2.0,
   "kl_aware": false,
   "importance_clip": 50.0,
   "use_lmmse": true,
   "n_rescaler_iters": 0,
   "sample_frac": null
}

field importance_clip: float

Show details

Maximum ratio by which a single token’s importance weight may exceed the mean weight. Clips extreme outlier tokens to prevent them from dominating the Hessian estimate.

Constraints:
  • gt = 0.0

field kl_aware: bool

Show details

When True, per-token importance weights derived from the attention distribution are folded into the Hessian so that tokens with higher attention mass receive tighter quantization.

field method: Literal['watersic_kv']

Show details

Fixed identifier for the WaterSIC KV-cache calibration method.

field n_rescaler_iters: int

Show details

Number of coordinate-descent iterations for the diagonal rescaler that adjusts per-column scale factors after LMMSE. Set to 0 to disable the rescaler (faster but slightly higher distortion).

Constraints:
  • ge = 0

field sample_frac: float | None

Show details

If set, only this fraction of rows (KV heads) are used during the binary search for c. Full rows are then quantized with the found c. Speeds up calibration on large KV caches at a small accuracy cost.

field target_rate: float

Show details

Average number of bits per quantized KV-cache element. The binary search over the ZSIC damping parameter c is driven to hit this rate.

Constraints:
  • gt = 0.0

field use_lmmse: bool

Show details

When True, the LMMSE (Linear Minimum Mean-Squared Error) shrinkage correction is applied after ZSIC quantization to partially undo quantization bias and reduce reconstruction NMSE.

field use_sequential: bool

Show details

Must be False for WaterSIC. Unlike weight quantization, KV-cache quantization does not have progressive error accumulation between layers, so sequential calibration is not needed.

class WaterSICKVHelper

Bases: object

Hook-based helper that captures Q/K activations and runs WaterSIC quantisation.

Usage:

helper = WaterSICKVHelper(quant_attn_module, "layer.3")
helper.setup()
# ... run calibration forward passes ...
state = helper.quantize(target_rate=4.0)
helper.cleanup()
helper.free()
__init__(module, name, kl_aware=False, importance_clip=50.0)

Initialize helper for a single attention module.

Parameters:
  • name (str)

  • kl_aware (bool)

  • importance_clip (float)

cleanup()

Remove the instance-level override, restoring the class staticmethod.

free()

Release collected calibration data.

quantize(target_rate=4.0, use_lmmse=True, n_rescaler_iters=0, sample_frac=None)

Run WaterSIC quantisation on the collected key activations.

Parameters:
  • target_rate (float) – Target coding rate in bits per element.

  • use_lmmse (bool) – Whether to apply LMMSE gain correction.

  • n_rescaler_iters (int) – Number of alternating rescaler iterations (0 = disable).

  • sample_frac (float) – Fraction of rows used by binary_search_c().

  • Returns

  • -------

  • WaterSICKVState

Return type:

WaterSICKVState

setup()

Patch _quantized_attention on the module instance to capture Q/K.

class WaterSICKVState

Bases: object

Per-layer quantisation state produced by WaterSICKVHelper.quantize().

Z: Tensor

Integer code-book indices.

__init__(Z, alpha, gamma, perm, rate)
Parameters:
  • Z (Tensor)

  • alpha (Tensor)

  • gamma (Tensor)

  • perm (Tensor | None)

  • rate (float)

Return type:

None

alpha: Tensor

Per-column step sizes.

gamma: Tensor

Per-column LMMSE gains.

perm: Tensor | None

Column permutation (or None).

rate: float

Achieved coding rate (bits per element).