Hybrid Cache Manager#

class HybridCacheManager#

Top-level cache manager for hybrid Attention + Mamba architectures.

Routes cache access by absolute decoder-layer index to the appropriate sub-manager (KVCacheManager for attention layers, MambaCacheManager for recurrent layers). Owns the shared device KV cache lengths tensor and provides unified compaction, prompt cache capture / restore, and batch management APIs that span both sub-managers.

Public Types

enum class LayerType#

Type of a single decoder layer.

Values:

enumerator kAttention#
enumerator kMamba#

Public Functions

HybridCacheManager() noexcept = default#

Default constructor (no allocation)

HybridCacheManager(Config const &config, cudaStream_t stream)#

Construct and initialise sub-managers and routing tables.

Builds absolute-to-local index mapping for each sub-manager, constructs KVCacheManager and MambaCacheManager, and allocates the shared device KV-cache-lengths tensor (zero-initialised).

Parameters:
  • config – Cache configuration with per-layer type routing

  • stream – CUDA stream for allocation and memset

~HybridCacheManager() noexcept#

Destructor.

HybridCacheManager(HybridCacheManager const&) = delete#

Deleted copy constructor.

HybridCacheManager &operator=(HybridCacheManager const&) = delete#

Deleted copy assignment.

Returns:

Reference to this

HybridCacheManager(HybridCacheManager&&) noexcept#

Move constructor.

HybridCacheManager &operator=(HybridCacheManager&&) noexcept#

Move assignment operator.

Returns:

Reference to this

rt::Tensor &getCombinedKVCache(int32_t absLayerIdx)#

Get the combined KV cache for a given absolute layer index (must be an attention layer).

Parameters:

absLayerIdx – Absolute decoder-layer index.

Returns:

Reference to the per-layer tensor [maxBatch, 2, numKVHeads, maxSeqLen, headDim].

std::pair<rt::Tensor, rt::Tensor> getSeparateKVCache(
int32_t absLayerIdx
)#

Get the separate K and V caches for a given absolute layer index (must be an attention layer).

Parameters:

absLayerIdx – Absolute decoder-layer index.

Returns:

Pair of non-owned view tensors (K, V).

rt::Tensor &getRecurrentState(int32_t absLayerIdx)#

Get the recurrent state for a given absolute layer index (must be a Mamba layer).

Parameters:

absLayerIdx – Absolute decoder-layer index.

Returns:

Reference to the per-layer recurrent state tensor.

rt::Tensor &getConvState(int32_t absLayerIdx)#

Get the conv state for a given absolute layer index (must be a Mamba layer).

Parameters:

absLayerIdx – Absolute decoder-layer index.

Returns:

Reference to the per-layer conv state tensor.

KVCacheManager &getKVCacheManager() noexcept#

Direct access to the KV cache sub-manager.

Returns:

Reference to the KVCacheManager.

MambaCacheManager &getMambaCacheManager() noexcept#

Direct access to the Mamba state sub-manager.

Returns:

Reference to the MambaCacheManager.

std::vector<KVHeadDimGroupView> getKVHeadDimGroups() const#

Read-only views of the pre-computed KV head-dim groups.

Uniform models return a single group; hybrid Gemma4-style models return one group per distinct head dim. The underlying KVLayerInfo arrays are owned by this manager and remain valid for its lifetime.

Returns:

Vector of group views (one per distinct head dim).

rt::Tensor &getKVCacheLengths() noexcept#

Get the shared device KV cache lengths tensor.

Returns:

Reference to the device tensor of shape [activeBatchSize].

void resetForNewSequences(
rt::Tensor const &reuseKVCacheLengths,
cudaStream_t stream
)#

Reset state for new sequences. Validates batch size, copies reuse lengths from host to device, and updates the “all empty” flag.

Parameters:
  • reuseKVCacheLengths – Host INT32 tensor with reuse lengths, shape [batchSize].

  • stream – CUDA stream.

void commitSequenceLength(
rt::Tensor const &newContextLengths,
cudaStream_t stream
)#

Commit sequence lengths after prefill (element-wise increment from a GPU tensor).

Parameters:
  • newContextLengths – GPU INT32 tensor of context lengths, shape [activeBatchSize].

  • stream – CUDA stream.

void commitSequenceLength(int32_t increment, cudaStream_t stream)#

Commit sequence lengths after decode (scalar increment for all active sequences).

Parameters:
  • increment – Scalar increment (typically 1).

  • stream – CUDA stream.

int32_t getActiveBatchSize() const noexcept#

Get the number of active sequences.

Returns:

Active batch size.

void setActiveBatchSize(int32_t newActiveBatchSize)#

Set active batch size (used after batch eviction).

Parameters:

newActiveBatchSize – New active batch size.

Throws:

std::runtime_error – if out of range [0, maxBatchSize].

bool getKVCacheAllEmpty() const noexcept#

Check if KV cache for all sequences is empty.

Returns:

True if no prefill has been committed yet.

void compactBatch(
rt::Tensor const &batchMapping,
int32_t oldBatch,
int32_t newBatch,
cudaStream_t stream
)#

Compact both KV caches and Mamba states after batch eviction.

Parameters:
  • batchMapping – GPU tensor [oldBatch], mapping[i] = newBatchIdx or -1 (evicted).

  • oldBatch – Batch size before eviction.

  • newBatch – Batch size after eviction.

  • stream – CUDA stream.

std::vector<rt::Tensor> captureKVCache(
int32_t batchIdx,
int32_t sequenceLength,
cudaStream_t stream
)#

Capture KV cache for a single batch slot across all attention layers.

Parameters:
  • batchIdx – Batch slot to capture.

  • sequenceLength – Number of tokens to capture from the cache.

  • stream – CUDA stream.

Returns:

Vector of captured tensors (one per attention layer).

void restoreKVCache(
std::vector<rt::Tensor> const &saved,
int32_t batchIdx,
cudaStream_t stream
)#

Restore KV cache for a single batch slot across all attention layers.

Parameters:
  • saved – Previously captured KV cache tensors.

  • batchIdx – Target batch slot.

  • stream – CUDA stream.

std::vector<rt::Tensor> captureRecurrentStates(
int32_t batchIdx,
cudaStream_t stream
)#

Capture recurrent states for a single batch slot (delegates to MambaCacheManager).

Parameters:
  • batchIdx – Batch slot to capture.

  • stream – CUDA stream.

Returns:

Vector of captured tensors (one per recurrent layer).

std::vector<rt::Tensor> captureConvStates(
int32_t batchIdx,
cudaStream_t stream
)#

Capture conv states for a single batch slot (delegates to MambaCacheManager).

Parameters:
  • batchIdx – Batch slot to capture.

  • stream – CUDA stream.

Returns:

Vector of captured tensors (one per recurrent layer).

struct KVHeadDimGroupView#

Minimal read-only view of one pre-computed KV head-dim group.

Exposes just what callers (e.g. EAGLE base-verify) need to launch a batched per-layer kernel: a device-resident KVLayerInfo array plus the dispatch parameters. Internal bookkeeping (hostInfos, deviceScratchInfos, etc.) stays private.

Public Members

kernel::KVLayerInfo const *deviceLayerInfos#

Device pointer to layer-info array (size == numLayers)

int32_t numLayers#

Number of KV layers in this group.

int32_t headDim#

Head dimension shared by all layers in this group.

int32_t maxKVHeads#

Maximum numKVHeads across layers in this group.

struct KVHeadDimGroupView

Minimal read-only view of one pre-computed KV head-dim group.

Exposes just what callers (e.g. EAGLE base-verify) need to launch a batched per-layer kernel: a device-resident KVLayerInfo array plus the dispatch parameters. Internal bookkeeping (hostInfos, deviceScratchInfos, etc.) stays private.

Public Members

kernel::KVLayerInfo const *deviceLayerInfos

Device pointer to layer-info array (size == numLayers)

int32_t numLayers

Number of KV layers in this group.

int32_t headDim

Head dimension shared by all layers in this group.

int32_t maxKVHeads

Maximum numKVHeads across layers in this group.