KV Cache Utils Kernels#

Warning

doxygenfunction: Unable to resolve function “trt_edgellm::kernel::incrementLengthTensor” with arguments None in doxygen xml output for project “TensorRT Edge-LLM” from directory: ../cpp_docs/xml. Potential matches:

- void incrementLengthTensor(rt::Tensor &lengthTensor, int32_t increment, cudaStream_t stream)
- void incrementLengthTensor(rt::Tensor &lengthTensor, rt::Tensor const &newIncrementTensor, cudaStream_t stream)
void trt_edgellm::kernel::instantiateKVCacheFromTensor(
rt::Tensor &dstKVCacheBuffer,
rt::Tensor const &srcKVCacheTensor,
int32_t batchIdx,
cudaStream_t stream
)#

Instantiate the KVCache from a pre-computed KVCache tensor.

Helper function to instantiate the KVCache from a pre-computed KVCache tensor. Used to support KVCache reuse across multiple inference requests to speedup prefill step.

Parameters:
  • dstKVCacheBuffer[inout] The KVCache buffer to be instantiated. Layout: [numDecoderLayers, maxBatchSize, 2, numKVHeads, maxSequenceLength, headDim]

  • srcKVCacheTensor[in] The pre-computed KVCache tensor. Layout: [numDecoderLayers, 2, numKVHeads, sequenceLength, headDim]

  • batchIdx[in] The batch index of the KVCache to be instantiated

  • stream[in] The CUDA stream to be used

void trt_edgellm::kernel::saveKVCacheIntoTensor(
rt::Tensor &dstKVCacheTensor,
rt::Tensor const &srcKVCacheBuffer,
int32_t batchIdx,
cudaStream_t stream
)#

Save the KVCache into a tensor.

Helper function to save the KVCache into a tensor. Used to support KVCache reuse across multiple inference requests to speedup prefill step. SequenceLength of dstKVCacheTensor must be saved from the srcKVCacheBuffer.

Parameters:
  • dstKVCacheTensor[out] The KVCache tensor to be saved. Layout: [numDecoderLayers, 2, numKVHeads, sequenceLength, headDim]

  • srcKVCacheBuffer[in] The KVCache buffer to be saved. Layout: [numDecoderLayers, maxBatchSize, 2, numKVHeads, maxSequenceLength, headDim]

  • batchIdx[in] The batch index of the KVCache to be saved

  • stream[in] The CUDA stream to be used