KV Cache Manager#

class KVCacheManager#

Per-layer KV cache manager that supports heterogeneous head configurations across layers. Each attention layer gets its own independently-sized tensor with shape [maxBatchSize, 2, numKVHeads_i, maxSequenceLength, headDim_i]. This replaces the monolithic LinearKVCache allocation when layers have different numKVHeads or headDim.

Public Functions

KVCacheManager() noexcept = default#: Default constructor.

KVCacheManager(Config const &config, cudaStream_t stream)#

Construct and initialize per-layer KV cache.

Allocates one device tensor per attention layer. Once allocated, memory won’t be reallocated. Determines whether all layers share the same numKVHeads and headDim (uniform mode).

Parameters:

config – Cache configuration with per-layer configs
stream – CUDA stream for allocation

Throws:

std::runtime_error – if config is invalid or data type is unsupported

~KVCacheManager() noexcept#: Destructor.

KVCacheManager(KVCacheManager const&) = delete#: Deleted copy constructor to avoid large data copy.

KVCacheManager &operator=(KVCacheManager const&) = delete#

Deleted copy assignment to avoid large data copy.

Returns:: Reference to this

KVCacheManager(KVCacheManager&&) noexcept#: Move constructor.

KVCacheManager &operator=(KVCacheManager&&) noexcept#

Move assignment operator.

Returns:: Reference to this

rt::Tensor &getCombinedKVCache(int32_t attnLayerIdx) noexcept#

Get the combined KVCache for the given attention layer.

Parameters:: attnLayerIdx – The index of the attention layer.
Returns:: A reference to the tensor with shape [maxBatchSize, 2, numKVHeads_i, maxSequenceLength, headDim_i].

std::pair<rt::Tensor, rt::Tensor> getSeparateKVCache( int32_t attnLayerIdx ) noexcept#

Get the separate K and V caches for the given attention layer. Returns a pair of non-owned view tensors, the first is the K cache and the second is the V cache.

Parameters:: attnLayerIdx – The index of the attention layer.
Returns:: A pair of tensors with shapes [maxBatchSize, numKVHeads_i, maxSequenceLength, headDim_i].

KVLayerConfig const &getLayerConfig( int32_t attnLayerIdx ) const noexcept#

Get the layer configuration for the given attention layer.

Parameters:: attnLayerIdx – The index of the attention layer.
Returns:: The KVLayerConfig for this layer.

int32_t numLayers() const noexcept#

Get the number of attention layers.

Returns:: Number of attention layers

bool isUniform() const noexcept#

Check if all layers have the same numKVHeads and headDim.

Returns:: True if all layers are uniform

Config const &getConfig() const noexcept#

Get cache configuration.

Returns:: Cache configuration

struct KVLayerConfig#

Per-layer KV head configuration for heterogeneous models.

Public Members

int32_t numKVHeads = {}#: Number of key-value heads for this layer.

int32_t headDim = {}#: Head dimension for this layer.