config
Configuration for KV cache sparsity modes.
- ModeloptConfig TriAttentionConfig
Bases:
ModeloptBaseConfigConfiguration for TriAttention KV cache eviction.
TriAttention scores cached KV entries using a trigonometric model derived from pre-RoPE Q/K concentration. Calibration computes per-head frequency statistics; at runtime, the serving engine scores and evicts tokens periodically.
Show default config as JSON
- Default config (JSON):
{ "budget": 2048, "prune_interval": 128, "window_size": 128, "pruning_mode": "per_head", "score_aggregation": "mean", "offset_max_length": 65536, "disable_mlr": false, "disable_trig": false, "calib_size": 100000 }
- field budget: int
Show details
Number of KV tokens to retain per head after pruning.
- field calib_size: int
Show details
Number of tokens for calibration. 50K-960K, any domain.
- field disable_mlr: bool
Show details
If True, disable the magnitude linear regression extra term.
- field disable_trig: bool
Show details
If True, use only the additive (MLR) term for scoring.
- field offset_max_length: int
Show details
Offsets are [1, 2, 4, …, offset_max_length].
- field prune_interval: int
Show details
Re-score and evict every N generated tokens.
- field pruning_mode: Literal['per_head', 'per_layer_per_head']
Show details
‘per_head’: independent budget per KV head. ‘per_layer_per_head’: budget allocated per layer and head.
- field score_aggregation: Literal['mean', 'max']
Show details
How to aggregate scores across geometric offsets.
- field window_size: int
Show details
Number of most recent tokens always retained.