Linear KV Cache#
-
class LinearKVCache#
Static Linear KVCache that holds the KVCache for all decoder layers up to maxSequenceLength. The KVCache implement the design of:
Allocates memory for max supported batch size.
Memory Layout: [numAttentionLayers, maxBatchSize, 2, numKVHeads, maxSequenceLength, headDim]
Synchronous execution of batch requests, all the sequences in the batch will run prefill or decode at the same time.
Public Functions
-
LinearKVCache() noexcept = default#
Default constructor.
-
LinearKVCache(CacheConfig const &config, cudaStream_t stream)#
Construct and initialize KV cache.
Allocates device memory for KV cache. Once allocated, memory won’t be reallocated.
- Parameters:
config – Cache configuration
stream – CUDA stream for allocation
- Throws:
std::runtime_error – if CUDA operations fail or data type is unsupported
-
~LinearKVCache() noexcept#
Destructor.
-
LinearKVCache(LinearKVCache const&) = delete#
Deleted copy constructor to avoid large data copy.
-
LinearKVCache &operator=(LinearKVCache const&) = delete#
Deleted copy assignment to avoid large data copy.
- Returns:
Reference to this
-
LinearKVCache(LinearKVCache&&) noexcept#
Move constructor.
-
LinearKVCache &operator=(LinearKVCache&&) noexcept#
Move assignment operator.
- Returns:
Reference to this
- rt::Tensor getCombinedKVCacheForDecoderLayer(
- int32_t decoderLayerIdx
Get the combined KVCache for the given decoder layer, for EdgeLLM Attention TRT plugin implementation.
- Parameters:
decoderLayerIdx – The index of the decoder layer.
- Returns:
A non-owned tensor object with shape [batch_size, 2, num_kv_heads, max_sequence_length, head_dim] that points to the combined KVCache memory with shape information.
- std::pair<rt::Tensor, rt::Tensor> getSeparateKVCacheForDecoderLayer(
- int32_t decoderLayerIdx
Get the separate K and V caches for the given decoder layer, for TRT native KVCacheUpdate/Attention operations. Returns a pair of tensors, the first is the K cache and the second is the V cache.
- Parameters:
decoderLayerIdx – The index of the decoder layer.
- Returns:
A pair of tensors, the first is the K cache and the second is the V cache, with shapes [batch_size, num_kv_heads, max_sequence_length, head_dim].
-
rt::Tensor getSSMStateForLayer(int32_t mambaLayerIdx) noexcept#
Get SSM state tensor for a Mamba layer (non-owned view). Shape: [maxBatchSize, mambaNumHeads, mambaHeadDim, ssmStateSize]
-
rt::Tensor getConvStateForLayer(int32_t mambaLayerIdx) noexcept#
Get conv state tensor for a Mamba layer (non-owned view). Shape: [maxBatchSize, convDim, convKernel]
-
void clearMambaStates(cudaStream_t stream)#
Zero all SSM and conv state buffers (all layers, all batch slots). Called after warmup inference and before CUDA graph capture to ensure a clean starting state.
- std::vector<rt::Tensor> captureSSMStates(
- int32_t batchIdx,
- cudaStream_t stream
Copy one batch slot’s SSM states into freshly-allocated tensors (one per Mamba layer). Used to snapshot states when saving a system prompt cache entry.
- std::vector<rt::Tensor> captureConvStates(
- int32_t batchIdx,
- cudaStream_t stream
Copy one batch slot’s conv states into freshly-allocated tensors (one per Mamba layer). Used to snapshot states when saving a system prompt cache entry.
- void resetForNewSequences( )#
Asynchronously reset the KVCache buffer state for a new setup of input context.
- Parameters:
hostReuseKVCacheLengths – The lengths of the KVCache to be reused from precomputed KVCache content.
stream – The stream is used to perform GPU memory operations.
- Throws:
std::runtime_error – if tensor shape, location or data type are invalid, or if a CUDA operation fails
- void commitSequenceLength( )#
Asynchronously commit the KVCache buffer for a prefill request, record stored KVCache lengths.
- Parameters:
newContextLengths – [GPU, Int32]: The context length to commit for the KVCache.
stream – The stream is used to perform GPU memory operations.
- Throws:
std::runtime_error – if tensor shape, location or data type are invalid
-
void commitSequenceLength(int32_t increment, cudaStream_t stream)#
Commit the KVCache buffer for a decode request, increment the KVCache lengths by 1 for active sequences.
- Parameters:
increment – The amount to increment sequence lengths (typically 1 for decode step)
stream – The stream is used to perform GPU memory operations.
- Throws:
std::runtime_error – if KV cache lengths tensor has wrong location or data type
-
rt::Tensor &getKVCacheLengths() noexcept#
Get KV cache lengths for active sequences.
- Returns:
Reference to KV cache lengths tensor
-
CacheConfig getConfig() const noexcept#
Get KV cache configuration.
- Returns:
Cache configuration
-
int32_t getActiveBatchSize() const noexcept#
Get active batch size.
- Returns:
Number of active sequences
-
bool getKVCacheAllEmpty() const noexcept#
Get flag to indicate if KVCache for all sequences are empty.
- Returns:
Flag to indicate if KVCache for all sequences are empty.
-
void setActiveBatchSize(int32_t newActiveBatchSize)#
Set active batch size (for batch eviction)
- Parameters:
newActiveBatchSize – New active batch size after eviction
- Throws:
std::runtime_error – If newActiveBatchSize is out of valid range [0, maxBatchSize]