System Prompt Cache#

Overview#

System (instruction) prompt cache optimizes prefill latency by reusing KV cache from previously computed system prompts. When the same system prompt is used across multiple requests, the runtime skips re-computing its KV cache, reducing time-to-first-token (TTFT).

Key Points:

  • First request with save_system_prompt_kv_cache: true generates and saves the cache

  • Subsequent requests automatically reuse the cached KV cache

  • Cache is keyed by (system_prompt_text, lora_weights_name) — exact string match required

  • In-memory only — cache persists for runtime lifetime but not across restarts


Usage#

Basic Example#

First request saves the cache:

{
    "requests": [
        {
            "messages": [
                {"role": "system", "content": "You are a helpful Python programming assistant."},
                {"role": "user", "content": "How do I read a CSV file?"}
            ],
            "save_system_prompt_kv_cache": true
        }
    ]
}

Subsequent requests automatically reuse it (no flag needed):

{
    "requests": [
        {
            "messages": [
                {"role": "system", "content": "You are a helpful Python programming assistant."},
                {"role": "user", "content": "How do I write JSON?"}
            ]
        }
    ]
}

LoRA Support#

Caches are LoRA-aware — different LoRA adapters create separate cache entries for the same system prompt:

{
    "available_lora_weights": {
        "french": "/path/to/french_adapter.safetensors"
    },
    "requests": [
        {
            "messages": [
                {"role": "system", "content": "Translate the following to French."},
                {"role": "user", "content": "Hello"}
            ],
            "lora_name": "french",
            "save_system_prompt_kv_cache": true
        }
    ]
}

EAGLE Speculative Decoding#

Fully supported — both base and draft model KV caches are saved and reused automatically. No special configuration needed.


Limitations#

  • Multimodal system prompt not supported: System prompt shall only contain text data.

  • Exact string match required: Any whitespace or punctuation difference causes cache miss

  • In-memory only: Cache lost when runtime terminates

  • Chat template aware: When apply_chat_template: true (default), formatted prompt is used as cache key


Best Practices#

When to Use:

  • Long system prompts (> 1K tokens) used repeatedly

  • Multi-tenant serving with role-specific system prompts

  • Agent systems with consistent instruction templates

Not Recommended:

  • Short prompts (< 100 tokens) — overhead exceeds benefit

  • Frequently changing prompts

Tips:

  • Standardize system prompts to maximize cache hit rate

  • Pre-warm cache during initialization with save_system_prompt_kv_cache: true


Notes#

  • No build-time configuration required — feature is always available

  • Enable debug logging (llm_inference --debug) to verify cache usage

  • Prefill metrics track reused vs. computed tokens for monitoring cache effectiveness