config

Configuration for KV cache sparsity modes.

ModeloptConfig TriAttentionConfig

Bases: ModeloptBaseConfig

Configuration for TriAttention KV cache eviction.

TriAttention scores cached KV entries using a trigonometric model derived from pre-RoPE Q/K concentration. Calibration computes per-head frequency statistics; at runtime, the serving engine scores and evicts tokens periodically.

budget is the absolute token count to retain per head after each pruning round. Runtime compression follows slack/sawtooth semantics: grow until budget + prune_interval tokens are present, then evict back to budget.

Show default config as JSON

Default config (JSON):

{
   "budget": null,
   "prune_interval": 128,
   "window_size": 128,
   "pruning_mode": "per_head",
   "score_aggregation": "mean",
   "offset_max_length": 65536,
   "disable_mlr": false,
   "disable_trig": false,
   "calib_size": 100000
}

field budget: int | None

Show details

Number of KV tokens to retain per head after pruning. Runtime compression triggers after prune_interval additional tokens.

field calib_size: int

Show details

Number of tokens for calibration. 50K-960K, any domain.

field disable_mlr: bool

Show details

If True, disable the magnitude linear regression extra term.

field disable_trig: bool

Show details

If True, use only the additive (MLR) term for scoring.

field offset_max_length: int

Show details

Offsets are [1, 2, 4, …, offset_max_length].

field prune_interval: int

Show details

Re-score and evict every N generated tokens.

field pruning_mode: Literal['per_head', 'per_layer_per_head']

Show details

‘per_head’: independent budget per KV head. ‘per_layer_per_head’: budget allocated per layer and head.

field score_aggregation: Literal['mean', 'max']

Show details

How to aggregate scores across geometric offsets.

field window_size: int

Show details

Number of most recent tokens always retained.