KV Cache Utils Kernels#
Warning
doxygenfunction: Unable to resolve function “trt_edgellm::kernel::incrementLengthTensor” with arguments None in doxygen xml output for project “TensorRT Edge-LLM” from directory: ../cpp_docs/xml. Potential matches:
- void incrementLengthTensor(rt::Tensor &lengthTensor, int32_t increment, cudaStream_t stream)
- void incrementLengthTensor(rt::Tensor &lengthTensor, rt::Tensor const &newIncrementTensor, cudaStream_t stream)
- void trt_edgellm::kernel::instantiateKVCacheLayerFromTensor(
- rt::Tensor &dstKVCacheLayer,
- rt::Tensor const &srcKVCacheTensor,
- int32_t batchIdx,
- cudaStream_t stream
Single-layer variant: instantiate KV cache for one layer from a saved tensor.
- Parameters:
dstKVCacheLayer – [inout] [maxBatchSize, 2, numKVHeads, maxSequenceLength, headDim]
srcKVCacheTensor – [in] [2, numKVHeads, sequenceLength, headDim]
batchIdx – [in] Target batch index in the destination buffer
stream – [in] CUDA stream
- void trt_edgellm::kernel::saveKVCacheLayerIntoTensor(
- rt::Tensor &dstKVCacheTensor,
- rt::Tensor const &srcKVCacheLayer,
- int32_t batchIdx,
- cudaStream_t stream
Single-layer variant: save KV cache for one layer into a tensor.
- Parameters:
dstKVCacheTensor – [out] [2, numKVHeads, sequenceLength, headDim]
srcKVCacheLayer – [in] [maxBatchSize, 2, numKVHeads, maxSequenceLength, headDim]
batchIdx – [in] Source batch index in the buffer
stream – [in] CUDA stream
- void trt_edgellm::kernel::saveKVCacheBatched(
- KVLayerInfo const *srcLayerInfos,
- KVLayerInfo const *dstLayerInfos,
- int32_t numLayers,
- int32_t headDim,
- int32_t maxKVHeads,
- int32_t maxBatchSize,
- int32_t batchIdx,
- int32_t sequenceLength,
- cudaStream_t stream
Batched save: copy multiple layers’ KV cache into per-layer tensors in a single launch. All layers must share the same headDim. dstLayerInfos[i].data points to a [2, numKVHeads_i, seqLen, headDim] tensor.
- Parameters:
srcLayerInfos – [numLayers] GPU array — source cache buffers
dstLayerInfos – [numLayers] GPU array — destination saved tensors
numLayers – Number of layers in this batch
headDim – Head dimension (same for all layers)
maxKVHeads – Maximum numKVHeads across all layers (for grid sizing)
maxBatchSize – Max batch size of the source cache
batchIdx – Batch index to save from
sequenceLength – Number of tokens to copy
stream – CUDA stream
- void trt_edgellm::kernel::instantiateKVCacheBatched(
- KVLayerInfo const *dstLayerInfos,
- KVLayerInfo const *srcLayerInfos,
- int32_t numLayers,
- int32_t headDim,
- int32_t maxKVHeads,
- int32_t maxBatchSize,
- int32_t batchIdx,
- int32_t sequenceLength,
- cudaStream_t stream
Batched restore: load multiple layers’ KV cache from per-layer tensors in a single launch. All layers must share the same headDim. srcLayerInfos[i].data points to a [2, numKVHeads_i, seqLen, headDim] tensor.
- Parameters:
dstLayerInfos – [numLayers] GPU array — destination cache buffers
srcLayerInfos – [numLayers] GPU array — source saved tensors
numLayers – Number of layers in this batch
headDim – Head dimension (same for all layers)
maxKVHeads – Maximum numKVHeads across all layers (for grid sizing)
maxBatchSize – Max batch size of the destination cache
batchIdx – Batch index to restore into
sequenceLength – Number of tokens to copy
stream – CUDA stream