Linear KV Cache#

class LinearKVCache#

Static Linear KVCache that holds the KVCache for all decoder layers up to maxSequenceLength. The KVCache implement the design of:

  1. Allocates memory for max supported batch size.

  2. Memory Layout: [numAttentionLayers, maxBatchSize, 2, numKVHeads, maxSequenceLength, headDim]

  3. Synchronous execution of batch requests, all the sequences in the batch will run prefill or decode at the same time.

Public Functions

LinearKVCache() noexcept = default#

Default constructor.

LinearKVCache(CacheConfig const &config, cudaStream_t stream)#

Construct and initialize KV cache.

Allocates device memory for KV cache. Once allocated, memory won’t be reallocated.

Parameters:
  • config – Cache configuration

  • stream – CUDA stream for allocation

Throws:

std::runtime_error – if CUDA operations fail or data type is unsupported

~LinearKVCache() noexcept#

Destructor.

LinearKVCache(LinearKVCache const&) = delete#

Deleted copy constructor to avoid large data copy.

LinearKVCache &operator=(LinearKVCache const&) = delete#

Deleted copy assignment to avoid large data copy.

Returns:

Reference to this

LinearKVCache(LinearKVCache&&) noexcept#

Move constructor.

LinearKVCache &operator=(LinearKVCache&&) noexcept#

Move assignment operator.

Returns:

Reference to this

rt::Tensor getCombinedKVCacheForDecoderLayer(
int32_t decoderLayerIdx
) noexcept#

Get the combined KVCache for the given decoder layer, for EdgeLLM Attention TRT plugin implementation.

Parameters:

decoderLayerIdx – The index of the decoder layer.

Returns:

A non-owned tensor object with shape [batch_size, 2, num_kv_heads, max_sequence_length, head_dim] that points to the combined KVCache memory with shape information.

std::pair<rt::Tensor, rt::Tensor> getSeparateKVCacheForDecoderLayer(
int32_t decoderLayerIdx
) noexcept#

Get the separate K and V caches for the given decoder layer, for TRT native KVCacheUpdate/Attention operations. Returns a pair of tensors, the first is the K cache and the second is the V cache.

Parameters:

decoderLayerIdx – The index of the decoder layer.

Returns:

A pair of tensors, the first is the K cache and the second is the V cache, with shapes [batch_size, num_kv_heads, max_sequence_length, head_dim].

rt::Tensor getKVCacheBuffer() noexcept#

Get the full KVCache buffer as a non-owned tensor.

rt::Tensor getSSMStateForLayer(int32_t mambaLayerIdx) noexcept#

Get SSM state tensor for a Mamba layer (non-owned view). Shape: [maxBatchSize, mambaNumHeads, mambaHeadDim, ssmStateSize]

rt::Tensor getConvStateForLayer(int32_t mambaLayerIdx) noexcept#

Get conv state tensor for a Mamba layer (non-owned view). Shape: [maxBatchSize, convDim, convKernel]

void clearMambaStates(cudaStream_t stream)#

Zero all SSM and conv state buffers (all layers, all batch slots). Called after warmup inference and before CUDA graph capture to ensure a clean starting state.

std::vector<rt::Tensor> captureSSMStates(
int32_t batchIdx,
cudaStream_t stream
)#

Copy one batch slot’s SSM states into freshly-allocated tensors (one per Mamba layer). Used to snapshot states when saving a system prompt cache entry.

std::vector<rt::Tensor> captureConvStates(
int32_t batchIdx,
cudaStream_t stream
)#

Copy one batch slot’s conv states into freshly-allocated tensors (one per Mamba layer). Used to snapshot states when saving a system prompt cache entry.

void resetForNewSequences(
rt::Tensor const &hostReuseKVCacheLengths,
cudaStream_t stream
)#

Asynchronously reset the KVCache buffer state for a new setup of input context.

Parameters:
  • hostReuseKVCacheLengths – The lengths of the KVCache to be reused from precomputed KVCache content.

  • stream – The stream is used to perform GPU memory operations.

Throws:

std::runtime_error – if tensor shape, location or data type are invalid, or if a CUDA operation fails

void commitSequenceLength(
rt::Tensor const &newContextLengths,
cudaStream_t stream
)#

Asynchronously commit the KVCache buffer for a prefill request, record stored KVCache lengths.

Parameters:
  • newContextLengths – [GPU, Int32]: The context length to commit for the KVCache.

  • stream – The stream is used to perform GPU memory operations.

Throws:

std::runtime_error – if tensor shape, location or data type are invalid

void commitSequenceLength(int32_t increment, cudaStream_t stream)#

Commit the KVCache buffer for a decode request, increment the KVCache lengths by 1 for active sequences.

Parameters:
  • increment – The amount to increment sequence lengths (typically 1 for decode step)

  • stream – The stream is used to perform GPU memory operations.

Throws:

std::runtime_error – if KV cache lengths tensor has wrong location or data type

rt::Tensor &getKVCacheLengths() noexcept#

Get KV cache lengths for active sequences.

Returns:

Reference to KV cache lengths tensor

CacheConfig getConfig() const noexcept#

Get KV cache configuration.

Returns:

Cache configuration

int32_t getActiveBatchSize() const noexcept#

Get active batch size.

Returns:

Number of active sequences

bool getKVCacheAllEmpty() const noexcept#

Get flag to indicate if KVCache for all sequences are empty.

Returns:

Flag to indicate if KVCache for all sequences are empty.

void setActiveBatchSize(int32_t newActiveBatchSize)#

Set active batch size (for batch eviction)

Parameters:

newActiveBatchSize – New active batch size after eviction

Throws:

std::runtime_error – If newActiveBatchSize is out of valid range [0, maxBatchSize]