Hybrid Cache Manager#

class HybridCacheManager#

Top-level cache manager for hybrid Attention + Mamba architectures.

Routes cache access by absolute decoder-layer index to the appropriate sub-manager (KVCacheManager for attention layers, MambaCacheManager for recurrent layers). Owns the shared device KV cache lengths tensor and provides unified compaction, prompt cache capture / restore, and batch management APIs that span both sub-managers.

Public Types

enum class LayerType#

Type of a single decoder layer.

Values:

enumerator kAttention#

enumerator kMamba#

Public Functions

HybridCacheManager() noexcept = default#: Default constructor (no allocation)

HybridCacheManager(Config const &config, cudaStream_t stream)#

Construct and initialise sub-managers and routing tables.

Builds absolute-to-local index mapping for each sub-manager, constructs KVCacheManager and MambaCacheManager, and allocates the shared device KV-cache-lengths tensor (zero-initialised).

Parameters:

config – Cache configuration with per-layer type routing
stream – CUDA stream for allocation and memset

~HybridCacheManager() noexcept#: Destructor.

HybridCacheManager(HybridCacheManager const&) = delete#: Deleted copy constructor.

HybridCacheManager &operator=(HybridCacheManager const&) = delete#

Deleted copy assignment.

Returns:: Reference to this

HybridCacheManager(HybridCacheManager&&) noexcept#: Move constructor.

HybridCacheManager &operator=(HybridCacheManager&&) noexcept#

Move assignment operator.

Returns:: Reference to this

rt::Tensor &getCombinedKVCache(int32_t absLayerIdx)#

Get the combined KV cache for a given absolute layer index (must be an attention layer).

Parameters:: absLayerIdx – Absolute decoder-layer index.
Returns:: Reference to the per-layer tensor [maxBatch, 2, numKVHeads, maxSeqLen, headDim].

std::pair<rt::Tensor, rt::Tensor> getSeparateKVCache( int32_t absLayerIdx )#

Get the separate K and V caches for a given absolute layer index (must be an attention layer).

Parameters:: absLayerIdx – Absolute decoder-layer index.
Returns:: Pair of non-owned view tensors (K, V).

rt::Tensor &getRecurrentState(int32_t absLayerIdx)#

Get the recurrent state for a given absolute layer index (must be a Mamba layer).

Parameters:: absLayerIdx – Absolute decoder-layer index.
Returns:: Reference to the per-layer recurrent state tensor.

rt::Tensor &getConvState(int32_t absLayerIdx)#

Get the conv state for a given absolute layer index (must be a Mamba layer).

Parameters:: absLayerIdx – Absolute decoder-layer index.
Returns:: Reference to the per-layer conv state tensor.

KVCacheManager &getKVCacheManager() noexcept#

Direct access to the KV cache sub-manager.

Returns:: Reference to the KVCacheManager.

MambaCacheManager &getMambaCacheManager() noexcept#

Direct access to the Mamba state sub-manager.

Returns:: Reference to the MambaCacheManager.

std::vector<KVHeadDimGroupView> getKVHeadDimGroups() const#

Read-only views of the pre-computed KV head-dim groups.

Uniform models return a single group; hybrid Gemma4-style models return one group per distinct head dim. The underlying KVLayerInfo arrays are owned by this manager and remain valid for its lifetime.

Returns:: Vector of group views (one per distinct head dim).

rt::Tensor &getKVCacheLengths() noexcept#

Get the shared device KV cache lengths tensor.

Returns:: Reference to the device tensor of shape [activeBatchSize].

void resetForNewSequences( rt::Tensor const &reuseKVCacheLengths, cudaStream_t stream )#

Reset state for new sequences. Validates batch size, copies reuse lengths from host to device, and updates the “all empty” flag.

Parameters:

reuseKVCacheLengths – Host INT32 tensor with reuse lengths, shape [batchSize].
stream – CUDA stream.

void commitSequenceLength( rt::Tensor const &newContextLengths, cudaStream_t stream )#

Commit sequence lengths after prefill (element-wise increment from a GPU tensor).

Parameters:

newContextLengths – GPU INT32 tensor of context lengths, shape [activeBatchSize].
stream – CUDA stream.

void commitSequenceLength(int32_t increment, cudaStream_t stream)#

Commit sequence lengths after decode (scalar increment for all active sequences).

Parameters:

increment – Scalar increment (typically 1).
stream – CUDA stream.

int32_t getActiveBatchSize() const noexcept#

Get the number of active sequences.

Returns:: Active batch size.

void setActiveBatchSize(int32_t newActiveBatchSize)#

Set active batch size (used after batch eviction).

Parameters:: newActiveBatchSize – New active batch size.
Throws:: std::runtime_error – if out of range [0, maxBatchSize].

bool getKVCacheAllEmpty() const noexcept#

Check if KV cache for all sequences is empty.

Returns:: True if no prefill has been committed yet.

void compactBatch( rt::Tensor const &batchMapping, int32_t oldBatch, int32_t newBatch, cudaStream_t stream )#

Compact both KV caches and Mamba states after batch eviction.

Parameters:

batchMapping – GPU tensor [oldBatch], mapping[i] = newBatchIdx or -1 (evicted).
oldBatch – Batch size before eviction.
newBatch – Batch size after eviction.
stream – CUDA stream.

std::vector<rt::Tensor> captureKVCache( int32_t batchIdx, int32_t sequenceLength, cudaStream_t stream )#

Capture KV cache for a single batch slot across all attention layers.

Parameters:

batchIdx – Batch slot to capture.
sequenceLength – Number of tokens to capture from the cache.
stream – CUDA stream.

Returns:

Vector of captured tensors (one per attention layer).

void restoreKVCache( std::vector<rt::Tensor> const &saved, int32_t batchIdx, cudaStream_t stream )#

Restore KV cache for a single batch slot across all attention layers.

Parameters:

saved – Previously captured KV cache tensors.
batchIdx – Target batch slot.
stream – CUDA stream.

std::vector<rt::Tensor> captureRecurrentStates( int32_t batchIdx, cudaStream_t stream )#

Capture recurrent states for a single batch slot (delegates to MambaCacheManager).

Parameters:

batchIdx – Batch slot to capture.
stream – CUDA stream.

Returns:

Vector of captured tensors (one per recurrent layer).

std::vector<rt::Tensor> captureConvStates( int32_t batchIdx, cudaStream_t stream )#

Capture conv states for a single batch slot (delegates to MambaCacheManager).

Parameters:

batchIdx – Batch slot to capture.
stream – CUDA stream.

Returns:

Vector of captured tensors (one per recurrent layer).

struct KVHeadDimGroupView#

Minimal read-only view of one pre-computed KV head-dim group.

Exposes just what callers (e.g. EAGLE base-verify) need to launch a batched per-layer kernel: a device-resident KVLayerInfo array plus the dispatch parameters. Internal bookkeeping (hostInfos, deviceScratchInfos, etc.) stays private.

Public Members

kernel::KVLayerInfo const *deviceLayerInfos#: Device pointer to layer-info array (size == numLayers)

int32_t numLayers#: Number of KV layers in this group.

int32_t headDim#: Head dimension shared by all layers in this group.

int32_t maxKVHeads#: Maximum numKVHeads across layers in this group.

struct KVHeadDimGroupView

Minimal read-only view of one pre-computed KV head-dim group.

Public Members

kernel::KVLayerInfo const *deviceLayerInfos: Device pointer to layer-info array (size == numLayers)

int32_t numLayers: Number of KV layers in this group.

int32_t headDim: Head dimension shared by all layers in this group.

int32_t maxKVHeads: Maximum numKVHeads across layers in this group.