KV Cache Manager#
-
class KVCacheManager#
Per-layer KV cache manager that supports heterogeneous head configurations across layers. Each attention layer gets its own independently-sized tensor with shape [maxBatchSize, 2, numKVHeads_i, maxSequenceLength, headDim_i]. This replaces the monolithic LinearKVCache allocation when layers have different numKVHeads or headDim.
Public Functions
-
KVCacheManager() noexcept = default#
Default constructor.
-
KVCacheManager(Config const &config, cudaStream_t stream)#
Construct and initialize per-layer KV cache.
Allocates one device tensor per attention layer. Once allocated, memory won’t be reallocated. Determines whether all layers share the same numKVHeads and headDim (uniform mode).
- Parameters:
config – Cache configuration with per-layer configs
stream – CUDA stream for allocation
- Throws:
std::runtime_error – if config is invalid or data type is unsupported
-
~KVCacheManager() noexcept#
Destructor.
-
KVCacheManager(KVCacheManager const&) = delete#
Deleted copy constructor to avoid large data copy.
-
KVCacheManager &operator=(KVCacheManager const&) = delete#
Deleted copy assignment to avoid large data copy.
- Returns:
Reference to this
-
KVCacheManager(KVCacheManager&&) noexcept#
Move constructor.
-
KVCacheManager &operator=(KVCacheManager&&) noexcept#
Move assignment operator.
- Returns:
Reference to this
-
rt::Tensor &getCombinedKVCache(int32_t attnLayerIdx) noexcept#
Get the combined KVCache for the given attention layer.
- Parameters:
attnLayerIdx – The index of the attention layer.
- Returns:
A reference to the tensor with shape [maxBatchSize, 2, numKVHeads_i, maxSequenceLength, headDim_i].
- std::pair<rt::Tensor, rt::Tensor> getSeparateKVCache(
- int32_t attnLayerIdx
Get the separate K and V caches for the given attention layer. Returns a pair of non-owned view tensors, the first is the K cache and the second is the V cache.
- Parameters:
attnLayerIdx – The index of the attention layer.
- Returns:
A pair of tensors with shapes [maxBatchSize, numKVHeads_i, maxSequenceLength, headDim_i].
- KVLayerConfig const &getLayerConfig(
- int32_t attnLayerIdx
Get the layer configuration for the given attention layer.
- Parameters:
attnLayerIdx – The index of the attention layer.
- Returns:
The KVLayerConfig for this layer.
-
int32_t numLayers() const noexcept#
Get the number of attention layers.
- Returns:
Number of attention layers
-
bool isUniform() const noexcept#
Check if all layers have the same numKVHeads and headDim.
- Returns:
True if all layers are uniform
-
Config const &getConfig() const noexcept#
Get cache configuration.
- Returns:
Cache configuration
-
KVCacheManager() noexcept = default#
-
struct KVLayerConfig#
Per-layer KV head configuration for heterogeneous models.