KV Cache Architecture#
Overview#
The caching system in AutoDeploy manages KV caches for attention layers, SSM/convolution states for Mamba models, and other stateful resources. The architecture is built around three key concepts:
Resource Handlers - Abstract descriptions of cache resources (shape, dtype, layout)
CachedSequenceInterface - The central manager that collects handlers and allocates caches
KVCacheManager / MambaHybridCacheManager - Low-level memory managers from the executor
Flowchart: Cache Collection and Allocation Pipeline#
flowchart TD
subgraph Phase1["PHASE 1: RESOURCE HANDLER COLLECTION"]
A1[AttentionDescriptor implementations]
A2["get_cache_initializers(node, config)"]
A3["cm.add_resource(k_indexed, handler)"]
A1 --> A2
A2 -->|"Returns ResourceHandlerDict"| A3
end
subgraph Phase2["PHASE 2: INITIALIZE CACHES"]
B1["_resource_lookup: Dict"]
B2["Initialize _caches with None values"]
B3["Order matches _resource_lookup"]
B1 --> B2
B2 --> B3
end
subgraph Phase3["PHASE 3: COMPATIBILITY CHECKING"]
C1["_identify_managed_kv_resources()"]
C2["_identify_managed_state_resources()"]
end
subgraph Phase4["PHASE 4: CACHE MANAGER CREATION"]
D1{"Has state resources?"}
D2["Create MambaHybridCacheManager"]
D3["Create KVCacheManager"]
D4["VIEW ASSIGNMENT"]
D5["self._caches = tensor_view"]
D1 -->|YES| D2
D1 -->|NO| D3
D2 --> D4
D3 --> D4
D4 --> D5
end
subgraph Phase5["PHASE 5: UNMANAGED RESOURCE ALLOCATION"]
E1["handler.allocate(sequence_info)"]
end
subgraph Phase6["PHASE 6: OPTIONAL RESIZE"]
F1["Recreate KVCacheManager with optimal capacity"]
end
Phase1 --> Phase2
Phase2 --> Phase3
Phase3 --> Phase4
Phase4 --> Phase5
Phase5 --> Phase6
Detailed Pipeline Flow#
For reference, here’s the detailed text-based flow:
PHASE 1: RESOURCE HANDLER COLLECTION (insert_cached_attention transform)
├── AttentionDescriptor implementations (e.g., FlashinferCachedAttention)
├── get_cache_initializers(node, config)
│ ├── Extracts shapes from FakeTensors
│ └── Returns ResourceHandlerDict {"kv_cache": KVPagedResourceHandler, ...}
└── For each attention node (idx=0,1,2...):
└── cm.add_resource(f"{k}_{idx}", handler) → Stores in CachedSequenceInterface
PHASE 2: INITIALIZE CACHES (initialize_resources)
├── _resource_lookup contains all collected handlers
└── Initialize _caches dict with None values (same order as _resource_lookup)
PHASE 3: COMPATIBILITY CHECKING (dynamically from _resource_lookup)
├── _identify_managed_kv_resources()
│ ├── Iterate _resource_lookup, find first KVPagedResourceHandler → kv_ref
│ ├── All KVPagedResourceHandlers matching kv_ref (head_dim, dtype, layout) → kv_managed
│ └── Non-matching handlers → local allocation later
└── _identify_managed_state_resources()
├── Iterate _resource_lookup, find first SSMResourceHandler → ssm_ref
├── Iterate _resource_lookup, find first CausalConvResourceHandler → conv_ref
├── Check n_groups constraint: conv_dim = head_dim*num_heads + 2*n_groups*d_state
└── If constraint fails → conv_ref = None (local allocation)
PHASE 4: CACHE MANAGER CREATION (_create_kv_cache_manager)
├── Has state resources? (ssm_managed or conv_managed)
│ ├── YES → Create MambaHybridCacheManager (manages KV + SSM + Conv)
│ └── NO → Create KVCacheManager (manages paged KV only)
└── View Assignment:
├── _assign_kv_cache_views() → manager.get_buffers(idx)
└── _create_and_assign_state_views() → manager.get_ssm_states/get_conv_states
PHASE 5: UNMANAGED RESOURCE ALLOCATION
└── For resources where self._caches[name] is None:
├── self._caches[name] = handler.allocate(sequence_info)
└── Track in _unmanaged_resources list (for proper .to() handling)
PHASE 6: OPTIONAL RESIZE (resize_kv_cache transform)
├── Run forward pass to measure activation memory
├── Shutdown existing KVCacheManager
├── Compute: mem_for_paged = (free_mem - non_paged - forward_mem) * free_gpu_fraction
└── Recreate KVCacheManager with optimal max_tokens
Key Resource Handler Types#
Handler Type |
Managed By |
Buffer Source |
Use Case |
|---|---|---|---|
|
|
|
Paged KV caches for attention |
|
|
|
Mamba SSM state |
|
|
|
Mamba causal conv state |
|
Local allocation |
|
Generic per-sequence state |
|
Local allocation |
|
Unpaged per-token resources |
Key Files and Their Responsibilities#
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py#
ResourceHandler(abstract base): Interface for allocating resourcesKVPagedResourceHandler: Describes paged KV cache withnum_kv_heads,head_dim,dtype,kv_layoutSSMResourceHandler: Describes Mamba SSM state withnum_heads,head_dim,d_stateCausalConvResourceHandler: Describes causal conv state withconv_dim,d_convAttentionDescriptor.get_cache_initializers(): ReturnsResourceHandlerDictmapping names to handlers
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py#
InsertCachedAttention: Iterates over attention nodes, callsget_cache_initializers(), and registers handlers viacm.add_resource()InitializeCache: Triggerscm.initialize_resources()to allocate all cachesResizeKVCache: Runs forward pass, measures memory, and callscm.resize_kv_cache_manager()
tensorrt_llm/_torch/auto_deploy/shim/interface.py#
CachedSequenceInterface: Central class managing all caches_resource_lookup: Dict of all registered resource handlers_unmanaged_resources: List tracking locally-allocated (non-managed) resource namesadd_resource(): Stores handlers in_resource_lookupinitialize_resources(): Initializes caches, creates cache managers, assigns views_identify_managed_kv_resources(): Finds compatible KV handlers from_resource_lookup_identify_managed_state_resources(): Finds compatible SSM/Conv handlers with constraint checking_create_kv_cache_manager(): CreatesKVCacheManagerorMambaHybridCacheManager_allocate_unmanaged_resources(): Allocates resources not managed by cache managers, tracks in_unmanaged_resources
Example Flow: FlashInfer Attention#
# In flashinfer_attention.py
class FlashinferCachedAttention(AttentionDescriptor):
@classmethod
def get_cache_initializers(cls, source_attn_node, cache_config):
k_fake = source_attn_node.args[1].meta["val"]
return {
"kv_cache": KVPagedResourceHandler(
num_kv_heads=k_fake.shape[2],
head_dim=k_fake.shape[3],
dtype=cls.resolve_cache_dtype(cache_config.dtype, k_fake.dtype),
kv_factor=2,
kv_layout="HND",
)
}
This handler gets:
Collected by
InsertCachedAttention→cm.add_resource("kv_cache_0", handler)Stored in
_resource_lookupduringinitialize_resources()Identified as manageable by
_identify_managed_kv_resources()if compatible with other KV handlersView assigned via
self._caches["kv_cache_0"] = manager.get_buffers(0, kv_layout="HND")
Compatibility Rules#
KV Cache Compatibility (for KVCacheManager)#
Handlers are compatible if they match on:
head_dimdtypekv_factorkv_layout
Note: num_kv_heads can differ (supports GQA/MQA with varying head counts per layer).
State Resource Compatibility (for MambaHybridCacheManager)#
SSM Resources: Compatible if state_shape and dtype match.
Conv Resources: Compatible if state_shape and dtype match, AND the n_groups constraint holds:
conv_dim = head_dim * num_heads + 2 * n_groups * d_state
If this constraint cannot be satisfied with integer n_groups >= 0, Conv resources fall back to local allocation.