Linear KV Cache#

class LinearKVCache#

Static Linear KVCache that holds the KVCache for all decoder layers up to maxSequenceLength. The KVCache implement the design of:

Allocates memory for max supported batch size.
Memory Layout: [numDecoderLayers, maxBatchSize, 2, numKVHeads, maxSequenceLength, headDim]
Synchronous execution of batch requests, all the sequences in the batch will run prefill or decode at the same time.

Public Types

using KVCacheType = half#: KV cache data type (half precision)

Public Functions

LinearKVCache() = default#: Default constructor.

LinearKVCache(CacheConfig const &config, cudaStream_t stream)#

Construct and initialize KV cache.

Allocates device memory for KV cache. Once allocated, memory won’t be reallocated.

Parameters:

config – Cache configuration
stream – CUDA stream for allocation

~LinearKVCache()#: Destructor.

LinearKVCache(LinearKVCache const&) = delete#: Deleted copy constructor to avoid large data copy.

LinearKVCache &operator=(LinearKVCache const&) = delete#

Deleted copy assignment to avoid large data copy.

Returns:: Reference to this

LinearKVCache(LinearKVCache&&) noexcept#: Move constructor.

LinearKVCache &operator=(LinearKVCache&&) noexcept#

Move assignment operator.

Returns:: Reference to this

rt::Tensor getKVCacheForDecoderLayer(int32_t decoderLayerIdx)#

Get the KVCache for the given decoder layer.

Parameters:: decoderLayerIdx – The index of the decoder layer.
Returns:: A non-owned tensor object that points to the KVCache memory with shape information.

rt::Tensor getKVCacheBuffer()#: Get the full KVCache buffer as a non-owned tensor.

void resetForNewSequences( rt::Tensor const &hostReuseKVCacheLengths, cudaStream_t stream )#

Asynchronously reset the KVCache buffer state for a new setup of input context.

Parameters:

hostReuseKVCacheLengths – The lengths of the KVCache to be reused from precomputed KVCache content.
stream – The stream is used to perform GPU memory operations.

void commitSequenceLength( rt::Tensor const &newContextLengths, cudaStream_t stream )#

Asynchronously commit the KVCache buffer for a prefill request, record stored KVCache lengths.

Parameters:

newContextLengths – [GPU, Int32]: The context length to commit for the KVCache.
stream – The stream is used to perform GPU memory operations.

void commitSequenceLength(int32_t increment, cudaStream_t stream)#

Commit the KVCache buffer for a decode request, increment the KVCache lengths by 1 for active sequences.

Parameters:

increment – The amount to increment sequence lengths (typically 1 for decode step)
stream – The stream is used to perform GPU memory operations.

rt::Tensor &getKVCacheLengths()#

Get KV cache lengths for active sequences.

Returns:: Reference to KV cache lengths tensor

CacheConfig getConfig() const#

Get KV cache configuration.

Returns:: Cache configuration

int32_t getActiveBatchSize() const#

Get active batch size.

Returns:: Number of active sequences

bool getKVCacheAllEmpty() const#

Get flag to indicate if KVCache for all sequences are empty.

Returns:: Flag to indicate if KVCache for all sequences are empty.

void setActiveBatchSize(int32_t newActiveBatchSize)#

Set active batch size (for batch eviction)

Parameters:: newActiveBatchSize – New active batch size after eviction

Public Static Attributes

static nvinfer1::DataType KVCacheTypeTRT = {nvinfer1::DataType::kHALF}#: TensorRT data type.