config

Configuration classes for sparse attention optimization.

ModeloptConfig CalibrationConfig

Bases: ModeloptBaseConfig

Configuration for automatic threshold calibration using RULER dataset.

Calibration fits an Exponential model to determine dynamic thresholds that achieve target sparsity. The model learns parameters a and b per phase:

scale_factor = a * exp(b * target_sparsity)

At inference time, the threshold is computed as:

threshold = scale_factor / sequence_length

Key benefits: - Target sparsity can be changed at runtime without recalibration - Threshold automatically adapts to sequence length - Supports independent prefill and decode phase calibration - Exponential model provides better fit (lower RMSE)

Show default config as JSON

Default config (JSON):

{
   "target_sparse_ratio": {
      "prefill": 0.5,
      "decode": 0.5
   },
   "samples": 24,
   "max_seqlen": 32768,
   "num_length_bins": 4,
   "chunk_size": 2048,
   "num_decode_tokens": 10,
   "threshold_trials": null,
   "cache_dir": null,
   "data_dir": null
}

field cache_dir: str | None

Show details

Directory to cache generated calibration samples. Caching avoids regenerating samples on repeated calibration runs.

field chunk_size: int

Show details

Chunk size for chunked prefill to avoid OOM with long sequences. When sequence length exceeds chunk_size, prefill is done in chunks using KV cache. Set to -1 to disable chunking (full prefill).

field data_dir: str | None

Show details

Path to RULER data directory (contains ‘essays’ subdir with Paul Graham .txt files). Required for NIAH essay tasks when not using repo layout. Set from example script or CLI.

field max_seqlen: int

Show details

Maximum sequence length for calibration (length bins auto-generated as powers of 2).

field num_decode_tokens: int

Show details

Number of decode tokens to generate for decode phase calibration.

field num_length_bins: int

Show details

Number of length bins to generate (hidden parameter, default: 4).

field samples: int

Show details

Total number of RULER samples for calibration (distributed across length bins). Default (24) provides 1 sample per task per length bin (4 bins * 6 RULER tasks). Increase for more robust calibration.

field target_sparse_ratio: dict[str, float]

Show details

Target ratio of sparse attention blocks (0.0 to 1.0). Dict with ‘prefill’ and ‘decode’ keys for per-phase targets. Set a phase value to 0.0 to skip calibration for that phase.

field threshold_trials: list[float] | None

Show details

List of threshold values to test during calibration. If None, uses default: [1e-6, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, 1e-2, 2e-2, 5e-2, 1e-1, 2e-1, 3e-1, 5e-1, 7e-1]. Increasing the number of trials improves calibration accuracy but slows down calibration.

ModeloptConfig FlashSkipSoftmaxConfig

Bases: SparseAttentionConfig

Configuration for Flash Attention-aware softmax skip sparse attention.

Show default config as JSON

Default config (JSON):

{
   "sparse_cfg": {
      "*attention*": {
         "backend": "pytorch",
         "bc": 128,
         "br": 128,
         "collect_stats": true,
         "enable": true,
         "method": "flash_skip_softmax",
         "thresholds": {
            "decode": [
               1e-05
            ],
            "prefill": [
               0.001
            ]
         }
      },
      "default": {
         "enable": false
      }
   },
   "export_format": null
}

field sparse_cfg: dict[str | Callable, dict[str, Any]]

Show details

Pattern-based configuration with flash_skip_softmax specific defaults. Includes FA block sizes (br, bc) and correction factor settings.

ModeloptConfig SparseAttentionAttributeConfig

Bases: ModeloptBaseConfig

Sparse attention attribute configuration for pattern-based module config.

Show default config as JSON

Default config (JSON):

{
   "method": "flash_skip_softmax",
   "enable": true,
   "thresholds": {
      "prefill": [
         0.001
      ],
      "decode": [
         0.0001
      ]
   },
   "br": 128,
   "bc": 128,
   "backend": "pytorch",
   "collect_stats": false,
   "is_causal": true
}

field backend: str

Show details

Backend to use for sparse attention computation. ‘pytorch’ uses softmax patching with F.softmax (requires attn_implementation=’eager’). ‘triton’ uses the fused Triton kernel (requires attn_implementation=’modelopt_triton’).

field bc: int

Show details

Block column size for block-wise sparsity in Flash Attention.

field br: int

Show details

Block row size for block-wise sparsity in Flash Attention.

field collect_stats: bool

Show details

Whether to collect sparsity statistics during forward pass for monitoring.

field enable: bool

Show details

If True, enables sparse attention. If False, bypasses sparsity.

field is_causal: bool

Show details

Whether the model uses causal (autoregressive) attention. If True, sparsity statistics are calculated over the lower triangle only. Set to False for cross-attention models. Defaults to True for decoder-only models like GPT, LLaMA, etc.

field method: str

Show details

The sparse attention method to use (e.g., ‘flash_skip_softmax’).

field thresholds: dict[str, list[float]]

Show details

Thresholds for determining which attention values to skip. Must be a dict with ‘prefill’ and/or ‘decode’ keys, each mapping to a list of floats. Prefill and decode lists must have the same length. Sparsity is computed per threshold; the first threshold’s mask is applied.

ModeloptConfig SparseAttentionConfig

Bases: ModeloptBaseConfig

Base configuration for sparse attention optimization.

This base configuration provides the common structure for all sparse attention methods and supports pattern-based layer configuration.

Show default config as JSON

Default config (JSON):

{
   "sparse_cfg": {
      "*attention*": {
         "enable": true,
         "method": "flash_skip_softmax"
      },
      "default": {
         "enable": false
      }
   },
   "export_format": null
}

field export_format: str | None

Show details

Export format for sparse attention (e.g., ‘onnx’, ‘tensorrt’)

field sparse_cfg: dict[str | Callable, dict[str, Any]]

Show details

Pattern-based configuration for sparse attention. Keys are patterns to match module names (or ‘calibration’ for global calibration settings), values are configuration dicts with parameters like ‘threshold’, ‘enable’, etc.