Hybrid Cache Manager#
-
class HybridCacheManager#
Top-level cache manager for hybrid Attention + Mamba architectures.
Routes cache access by absolute decoder-layer index to the appropriate sub-manager (
KVCacheManagerfor attention layers,MambaCacheManagerfor recurrent layers). Owns the shared device KV cache lengths tensor and provides unified compaction, prompt cache capture / restore, and batch management APIs that span both sub-managers.Public Types
Public Functions
-
HybridCacheManager() noexcept = default#
Default constructor (no allocation)
-
HybridCacheManager(Config const &config, cudaStream_t stream)#
Construct and initialise sub-managers and routing tables.
Builds absolute-to-local index mapping for each sub-manager, constructs
KVCacheManagerandMambaCacheManager, and allocates the shared device KV-cache-lengths tensor (zero-initialised).- Parameters:
config – Cache configuration with per-layer type routing
stream – CUDA stream for allocation and memset
-
~HybridCacheManager() noexcept#
Destructor.
-
HybridCacheManager(HybridCacheManager const&) = delete#
Deleted copy constructor.
-
HybridCacheManager &operator=(HybridCacheManager const&) = delete#
Deleted copy assignment.
- Returns:
Reference to this
-
HybridCacheManager(HybridCacheManager&&) noexcept#
Move constructor.
-
HybridCacheManager &operator=(HybridCacheManager&&) noexcept#
Move assignment operator.
- Returns:
Reference to this
-
rt::Tensor &getCombinedKVCache(int32_t absLayerIdx)#
Get the combined KV cache for a given absolute layer index (must be an attention layer).
- Parameters:
absLayerIdx – Absolute decoder-layer index.
- Returns:
Reference to the per-layer tensor [maxBatch, 2, numKVHeads, maxSeqLen, headDim].
- std::pair<rt::Tensor, rt::Tensor> getSeparateKVCache(
- int32_t absLayerIdx
Get the separate K and V caches for a given absolute layer index (must be an attention layer).
- Parameters:
absLayerIdx – Absolute decoder-layer index.
- Returns:
Pair of non-owned view tensors (K, V).
-
rt::Tensor &getRecurrentState(int32_t absLayerIdx)#
Get the recurrent state for a given absolute layer index (must be a Mamba layer).
- Parameters:
absLayerIdx – Absolute decoder-layer index.
- Returns:
Reference to the per-layer recurrent state tensor.
-
rt::Tensor &getConvState(int32_t absLayerIdx)#
Get the conv state for a given absolute layer index (must be a Mamba layer).
- Parameters:
absLayerIdx – Absolute decoder-layer index.
- Returns:
Reference to the per-layer conv state tensor.
-
KVCacheManager &getKVCacheManager() noexcept#
Direct access to the KV cache sub-manager.
- Returns:
Reference to the KVCacheManager.
-
MambaCacheManager &getMambaCacheManager() noexcept#
Direct access to the Mamba state sub-manager.
- Returns:
Reference to the MambaCacheManager.
-
std::vector<KVHeadDimGroupView> getKVHeadDimGroups() const#
Read-only views of the pre-computed KV head-dim groups.
Uniform models return a single group; hybrid Gemma4-style models return one group per distinct head dim. The underlying
KVLayerInfoarrays are owned by this manager and remain valid for its lifetime.- Returns:
Vector of group views (one per distinct head dim).
-
rt::Tensor &getKVCacheLengths() noexcept#
Get the shared device KV cache lengths tensor.
- Returns:
Reference to the device tensor of shape [activeBatchSize].
- void resetForNewSequences( )#
Reset state for new sequences. Validates batch size, copies reuse lengths from host to device, and updates the “all empty” flag.
- Parameters:
reuseKVCacheLengths – Host INT32 tensor with reuse lengths, shape [batchSize].
stream – CUDA stream.
- void commitSequenceLength( )#
Commit sequence lengths after prefill (element-wise increment from a GPU tensor).
- Parameters:
newContextLengths – GPU INT32 tensor of context lengths, shape [activeBatchSize].
stream – CUDA stream.
-
void commitSequenceLength(int32_t increment, cudaStream_t stream)#
Commit sequence lengths after decode (scalar increment for all active sequences).
- Parameters:
increment – Scalar increment (typically 1).
stream – CUDA stream.
-
int32_t getActiveBatchSize() const noexcept#
Get the number of active sequences.
- Returns:
Active batch size.
-
void setActiveBatchSize(int32_t newActiveBatchSize)#
Set active batch size (used after batch eviction).
- Parameters:
newActiveBatchSize – New active batch size.
- Throws:
std::runtime_error – if out of range [0, maxBatchSize].
-
bool getKVCacheAllEmpty() const noexcept#
Check if KV cache for all sequences is empty.
- Returns:
True if no prefill has been committed yet.
- void compactBatch( )#
Compact both KV caches and Mamba states after batch eviction.
- Parameters:
batchMapping – GPU tensor [oldBatch], mapping[i] = newBatchIdx or -1 (evicted).
oldBatch – Batch size before eviction.
newBatch – Batch size after eviction.
stream – CUDA stream.
- std::vector<rt::Tensor> captureKVCache(
- int32_t batchIdx,
- int32_t sequenceLength,
- cudaStream_t stream
Capture KV cache for a single batch slot across all attention layers.
- Parameters:
batchIdx – Batch slot to capture.
sequenceLength – Number of tokens to capture from the cache.
stream – CUDA stream.
- Returns:
Vector of captured tensors (one per attention layer).
- void restoreKVCache( )#
Restore KV cache for a single batch slot across all attention layers.
- Parameters:
saved – Previously captured KV cache tensors.
batchIdx – Target batch slot.
stream – CUDA stream.
- std::vector<rt::Tensor> captureRecurrentStates(
- int32_t batchIdx,
- cudaStream_t stream
Capture recurrent states for a single batch slot (delegates to MambaCacheManager).
- Parameters:
batchIdx – Batch slot to capture.
stream – CUDA stream.
- Returns:
Vector of captured tensors (one per recurrent layer).
- std::vector<rt::Tensor> captureConvStates(
- int32_t batchIdx,
- cudaStream_t stream
Capture conv states for a single batch slot (delegates to MambaCacheManager).
- Parameters:
batchIdx – Batch slot to capture.
stream – CUDA stream.
- Returns:
Vector of captured tensors (one per recurrent layer).
-
struct KVHeadDimGroupView#
Minimal read-only view of one pre-computed KV head-dim group.
Exposes just what callers (e.g. EAGLE base-verify) need to launch a batched per-layer kernel: a device-resident
KVLayerInfoarray plus the dispatch parameters. Internal bookkeeping (hostInfos,deviceScratchInfos, etc.) stays private.Public Members
-
kernel::KVLayerInfo const *deviceLayerInfos#
Device pointer to layer-info array (size == numLayers)
-
int32_t numLayers#
Number of KV layers in this group.
-
int32_t headDim#
Head dimension shared by all layers in this group.
-
int32_t maxKVHeads#
Maximum numKVHeads across layers in this group.
-
kernel::KVLayerInfo const *deviceLayerInfos#
-
HybridCacheManager() noexcept = default#
-
struct KVHeadDimGroupView
Minimal read-only view of one pre-computed KV head-dim group.
Exposes just what callers (e.g. EAGLE base-verify) need to launch a batched per-layer kernel: a device-resident
KVLayerInfoarray plus the dispatch parameters. Internal bookkeeping (hostInfos,deviceScratchInfos, etc.) stays private.Public Members
-
kernel::KVLayerInfo const *deviceLayerInfos
Device pointer to layer-info array (size == numLayers)
-
int32_t numLayers
Number of KV layers in this group.
-
int32_t headDim
Head dimension shared by all layers in this group.
-
int32_t maxKVHeads
Maximum numKVHeads across layers in this group.
-
kernel::KVLayerInfo const *deviceLayerInfos