config

Configuration classes for sparse attention optimization.

ModeloptConfig CalibrationConfig

Bases: ModeloptBaseConfig

Configuration for automatic threshold calibration using RULER dataset.

Calibration fits an Exponential model to determine dynamic thresholds that achieve target sparsity. The model learns parameters a and b per phase:

scale_factor = a * exp(b * target_sparsity)

At inference time, the threshold is computed as:

threshold = scale_factor / sequence_length

Key benefits: - Target sparsity can be changed at runtime without recalibration - Threshold automatically adapts to sequence length - Supports independent prefill and decode phase calibration - Exponential model provides better fit (lower RMSE)

Show default config as JSON
Default config (JSON):

{
   "target_sparse_ratio": {
      "prefill": 0.5,
      "decode": 0.5
   },
   "samples": 24,
   "max_seqlen": 32768,
   "num_length_bins": 4,
   "chunk_size": 2048,
   "num_decode_tokens": 10,
   "threshold_trials": null,
   "cache_dir": null,
   "data_dir": null
}

field cache_dir: str | None

Show details

Directory to cache generated calibration samples. Caching avoids regenerating samples on repeated calibration runs.

field chunk_size: int

Show details

Chunk size for chunked prefill to avoid OOM with long sequences. When sequence length exceeds chunk_size, prefill is done in chunks using KV cache. Set to -1 to disable chunking (full prefill).

field data_dir: str | None

Show details

Path to RULER data directory (contains ‘essays’ subdir with Paul Graham .txt files). Required for NIAH essay tasks when not using repo layout. Set from example script or CLI.

field max_seqlen: int

Show details

Maximum sequence length for calibration (length bins auto-generated as powers of 2).

field num_decode_tokens: int

Show details

Number of decode tokens to generate for decode phase calibration.

field num_length_bins: int

Show details

Number of length bins to generate (hidden parameter, default: 4).

field samples: int

Show details

Total number of RULER samples for calibration (distributed across length bins). Default (24) provides 1 sample per task per length bin (4 bins * 6 RULER tasks). Increase for more robust calibration.

field target_sparse_ratio: dict[str, float]

Show details

Target ratio of sparse attention blocks (0.0 to 1.0). Dict with ‘prefill’ and ‘decode’ keys for per-phase targets. Set a phase value to 0.0 to skip calibration for that phase.

field threshold_trials: list[float] | None

Show details

List of threshold values to test during calibration. If None, uses default: [1e-6, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 2e-2, 5e-2, 1e-1, 2e-1, 3e-1, 5e-1, 7e-1]. Increasing the number of trials improves calibration accuracy but slows down calibration.

ModeloptConfig FlashSkipSoftmaxConfig

Bases: SparseAttentionConfig

Configuration for Flash Attention-aware softmax skip sparse attention.

Show default config as JSON
Default config (JSON):

{
   "sparse_cfg": {
      "*attention*": {
         "backend": "pytorch",
         "bc": 128,
         "br": 128,
         "collect_stats": true,
         "enable": true,
         "method": "flash_skip_softmax",
         "threshold": {
            "decode": 1e-05,
            "prefill": 0.001
         }
      },
      "default": {
         "enable": false
      }
   },
   "export_format": null
}

field sparse_cfg: dict[str | Callable, dict[str, Any]]

Show details

Pattern-based configuration with flash_skip_softmax specific defaults. Includes FA block sizes (br, bc) and correction factor settings.

ModeloptConfig SparseAttentionAttributeConfig

Bases: ModeloptBaseConfig

Sparse attention attribute configuration for pattern-based module config.

Show default config as JSON
Default config (JSON):

{
   "method": "flash_skip_softmax",
   "enable": true,
   "threshold": {
      "prefill": 0.001,
      "decode": 0.0001
   },
   "br": 128,
   "bc": 128,
   "backend": "pytorch",
   "collect_stats": false,
   "is_causal": true
}

field backend: str

Show details

Backend to use for sparse attention computation. Only ‘pytorch’ is supported, which uses softmax patching with F.softmax. Requires model to be loaded with attn_implementation=’eager’.

field bc: int

Show details

Block column size for block-wise sparsity in Flash Attention.

field br: int

Show details

Block row size for block-wise sparsity in Flash Attention.

field collect_stats: bool

Show details

Whether to collect sparsity statistics during forward pass for monitoring.

field enable: bool

Show details

If True, enables sparse attention. If False, bypasses sparsity.

field is_causal: bool

Show details

Whether the model uses causal (autoregressive) attention. If True, sparsity statistics are calculated over the lower triangle only. Defaults to True for decoder-only models like GPT, LLaMA, etc.

field method: str

Show details

The sparse attention method to use (e.g., ‘flash_skip_softmax’).

field threshold: dict[str, float]

Show details

Threshold for determining which attention values to skip. Must be a dict with ‘prefill’ and ‘decode’ keys.

ModeloptConfig SparseAttentionConfig

Bases: ModeloptBaseConfig

Base configuration for sparse attention optimization.

This base configuration provides the common structure for all sparse attention methods and supports pattern-based layer configuration.

Show default config as JSON
Default config (JSON):

{
   "sparse_cfg": {
      "*attention*": {
         "enable": true,
         "method": "flash_skip_softmax"
      },
      "default": {
         "enable": false
      }
   },
   "export_format": null
}

field export_format: str | None

Show details

Export format for sparse attention (e.g., ‘onnx’, ‘tensorrt’)

field sparse_cfg: dict[str | Callable, dict[str, Any]]

Show details

Pattern-based configuration for sparse attention. Keys are patterns to match module names (or ‘calibration’ for global calibration settings), values are configuration dicts with parameters like ‘threshold’, ‘enable’, etc.