KV Cache Manager#

class KVCacheManager#

Per-layer KV cache manager that supports heterogeneous head configurations across layers. Each attention layer gets its own independently-sized tensor with shape [maxBatchSize, 2, numKVHeads_i, maxSequenceLength, headDim_i]. This replaces the monolithic LinearKVCache allocation when layers have different numKVHeads or headDim.

Public Functions

KVCacheManager() noexcept = default#

Default constructor.

KVCacheManager(Config const &config, cudaStream_t stream)#

Construct and initialize per-layer KV cache.

Allocates one device tensor per attention layer. Once allocated, memory won’t be reallocated. Determines whether all layers share the same numKVHeads and headDim (uniform mode).

Parameters:
  • config – Cache configuration with per-layer configs

  • stream – CUDA stream for allocation

Throws:

std::runtime_error – if config is invalid or data type is unsupported

~KVCacheManager() noexcept#

Destructor.

KVCacheManager(KVCacheManager const&) = delete#

Deleted copy constructor to avoid large data copy.

KVCacheManager &operator=(KVCacheManager const&) = delete#

Deleted copy assignment to avoid large data copy.

Returns:

Reference to this

KVCacheManager(KVCacheManager&&) noexcept#

Move constructor.

KVCacheManager &operator=(KVCacheManager&&) noexcept#

Move assignment operator.

Returns:

Reference to this

rt::Tensor &getCombinedKVCache(int32_t attnLayerIdx) noexcept#

Get the combined KVCache for the given attention layer.

Parameters:

attnLayerIdx – The index of the attention layer.

Returns:

A reference to the tensor with shape [maxBatchSize, 2, numKVHeads_i, maxSequenceLength, headDim_i].

std::pair<rt::Tensor, rt::Tensor> getSeparateKVCache(
int32_t attnLayerIdx
) noexcept#

Get the separate K and V caches for the given attention layer. Returns a pair of non-owned view tensors, the first is the K cache and the second is the V cache.

Parameters:

attnLayerIdx – The index of the attention layer.

Returns:

A pair of tensors with shapes [maxBatchSize, numKVHeads_i, maxSequenceLength, headDim_i].

KVLayerConfig const &getLayerConfig(
int32_t attnLayerIdx
) const noexcept#

Get the layer configuration for the given attention layer.

Parameters:

attnLayerIdx – The index of the attention layer.

Returns:

The KVLayerConfig for this layer.

int32_t numLayers() const noexcept#

Get the number of attention layers.

Returns:

Number of attention layers

bool isUniform() const noexcept#

Check if all layers have the same numKVHeads and headDim.

Returns:

True if all layers are uniform

Config const &getConfig() const noexcept#

Get cache configuration.

Returns:

Cache configuration

struct KVLayerConfig#

Per-layer KV head configuration for heterogeneous models.

Public Members

int32_t numKVHeads = {}#

Number of key-value heads for this layer.

int32_t headDim = {}#

Head dimension for this layer.