KV Cache Utils Kernels#

Warning

doxygenfunction: Unable to resolve function “trt_edgellm::kernel::incrementLengthTensor” with arguments None in doxygen xml output for project “TensorRT Edge-LLM” from directory: ../cpp_docs/xml. Potential matches:

- void incrementLengthTensor(rt::Tensor &lengthTensor, int32_t increment, cudaStream_t stream)
- void incrementLengthTensor(rt::Tensor &lengthTensor, rt::Tensor const &newIncrementTensor, cudaStream_t stream)
void trt_edgellm::kernel::instantiateKVCacheLayerFromTensor(
rt::Tensor &dstKVCacheLayer,
rt::Tensor const &srcKVCacheTensor,
int32_t batchIdx,
cudaStream_t stream
)#

Single-layer variant: instantiate KV cache for one layer from a saved tensor.

Parameters:
  • dstKVCacheLayer[inout] [maxBatchSize, 2, numKVHeads, maxSequenceLength, headDim]

  • srcKVCacheTensor[in] [2, numKVHeads, sequenceLength, headDim]

  • batchIdx[in] Target batch index in the destination buffer

  • stream[in] CUDA stream

void trt_edgellm::kernel::saveKVCacheLayerIntoTensor(
rt::Tensor &dstKVCacheTensor,
rt::Tensor const &srcKVCacheLayer,
int32_t batchIdx,
cudaStream_t stream
)#

Single-layer variant: save KV cache for one layer into a tensor.

Parameters:
  • dstKVCacheTensor[out] [2, numKVHeads, sequenceLength, headDim]

  • srcKVCacheLayer[in] [maxBatchSize, 2, numKVHeads, maxSequenceLength, headDim]

  • batchIdx[in] Source batch index in the buffer

  • stream[in] CUDA stream

void trt_edgellm::kernel::saveKVCacheBatched(
KVLayerInfo const *srcLayerInfos,
KVLayerInfo const *dstLayerInfos,
int32_t numLayers,
int32_t headDim,
int32_t maxKVHeads,
int32_t maxBatchSize,
int32_t batchIdx,
int32_t sequenceLength,
cudaStream_t stream
)#

Batched save: copy multiple layers’ KV cache into per-layer tensors in a single launch. All layers must share the same headDim. dstLayerInfos[i].data points to a [2, numKVHeads_i, seqLen, headDim] tensor.

Parameters:
  • srcLayerInfos – [numLayers] GPU array — source cache buffers

  • dstLayerInfos – [numLayers] GPU array — destination saved tensors

  • numLayers – Number of layers in this batch

  • headDim – Head dimension (same for all layers)

  • maxKVHeads – Maximum numKVHeads across all layers (for grid sizing)

  • maxBatchSize – Max batch size of the source cache

  • batchIdx – Batch index to save from

  • sequenceLength – Number of tokens to copy

  • stream – CUDA stream

void trt_edgellm::kernel::instantiateKVCacheBatched(
KVLayerInfo const *dstLayerInfos,
KVLayerInfo const *srcLayerInfos,
int32_t numLayers,
int32_t headDim,
int32_t maxKVHeads,
int32_t maxBatchSize,
int32_t batchIdx,
int32_t sequenceLength,
cudaStream_t stream
)#

Batched restore: load multiple layers’ KV cache from per-layer tensors in a single launch. All layers must share the same headDim. srcLayerInfos[i].data points to a [2, numKVHeads_i, seqLen, headDim] tensor.

Parameters:
  • dstLayerInfos – [numLayers] GPU array — destination cache buffers

  • srcLayerInfos – [numLayers] GPU array — source saved tensors

  • numLayers – Number of layers in this batch

  • headDim – Head dimension (same for all layers)

  • maxKVHeads – Maximum numKVHeads across all layers (for grid sizing)

  • maxBatchSize – Max batch size of the destination cache

  • batchIdx – Batch index to restore into

  • sequenceLength – Number of tokens to copy

  • stream – CUDA stream