Batch Evict Kernels#
-
struct KVLayerInfo#
Per-layer KV cache metadata for batched kernel operations.
- void trt_edgellm::kernel::compactKVCacheSingleLayer(
- rt::Tensor &kvCacheLayer,
- rt::Tensor const &batchMapping,
- rt::Tensor const &kvCacheLengths,
- rt::Tensor &dstKVCacheLengths,
- int32_t oldActiveBatch,
- int32_t newActiveBatch,
- bool updateLengths,
- cudaStream_t stream
Compact a single layer’s KV cache by removing evicted batches.
Single-layer variant of compactKVCache for per-layer heterogeneous KV cache.
- Parameters:
kvCacheLayer – [maxBatch, 2, numKVHeads, maxSeq, headDim] single-layer buffer (in/out)
batchMapping – [oldActiveBatch] GPU tensor, mapping[i] = newBatchIdx or -1 (evict)
kvCacheLengths – [maxBatch] GPU tensor of sequence lengths (const input)
dstKVCacheLengths – [maxBatch] GPU tensor for compacted lengths (output, may alias kvCacheLengths)
oldActiveBatch – Number of batches before eviction
newActiveBatch – Number of batches after eviction
updateLengths – If true, update dstKVCacheLengths (only first layer should do this)
stream – CUDA stream
- void trt_edgellm::kernel::compactTensorBatch(
- rt::Tensor const &src,
- rt::Tensor const &batchMapping,
- rt::Tensor &dst,
- int32_t oldActiveBatch,
- int32_t newActiveBatch,
- cudaStream_t stream
Generic tensor compaction along batch dimension.
This kernel compacts a tensor by removing evicted batches.
Note
Assumes batch dimension is the first dimension (dim 0)
Note
For in-place operation, pass the same tensor as both src and dst
- Parameters:
src – Source tensor (const input)
batchMapping – [oldActiveBatch] GPU tensor (const input), mapping[i] = newBatchIdx or -1
dst – Destination tensor (output, can be same as src for in-place operation)
oldActiveBatch – Number of batches before eviction
newActiveBatch – Number of batches after eviction
stream – CUDA stream
- Throws:
std::runtime_error – if tensors are not located on the GPU, or tensor shapes are invalid
- void trt_edgellm::kernel::compactKVCacheBatched(
- KVLayerInfo const *layerInfos,
- rt::Tensor const &batchMapping,
- rt::Tensor const &kvCacheLengths,
- int32_t numLayers,
- int32_t headDim,
- nvinfer1::DataType kvCacheType,
- int32_t maxKVHeads,
- int32_t maxBatchSize,
- int32_t oldActiveBatch,
- int32_t newActiveBatch,
- cudaStream_t stream
Batched compaction across multiple layers in a single kernel launch.
All layers in the batch must share the same headDim (template-selected). Layers may have different numKVHeads and maxSeqLen.
- Parameters:
layerInfos – [numLayers] GPU array of KVLayerInfo
batchMapping – [oldActiveBatch] GPU tensor
kvCacheLengths – [maxBatch] GPU tensor of sequence lengths
numLayers – Number of layers in this batch
headDim – Head dimension (same for all layers in batch)
kvCacheType – KV cache storage dtype (kHALF or kFP8); controls element size for stride calculation
maxKVHeads – Maximum numKVHeads across all layers (for grid sizing)
maxBatchSize – Max batch size
oldActiveBatch – Batches before eviction
newActiveBatch – Batches after eviction
stream – CUDA stream