Batch Evict Kernels#

struct KVLayerInfo#

Per-layer KV cache metadata for batched kernel operations.

Public Members

void *data#: Pointer to this layer’s KV buffer [maxB, 2, H, S, D].

int32_t numKVHeads#: Number of KV heads for this layer.

int32_t maxSeqLen#: Max sequence length for this layer.

void trt_edgellm::kernel::compactKVCacheSingleLayer( rt::Tensor &kvCacheLayer, rt::Tensor const &batchMapping, rt::Tensor const &kvCacheLengths, rt::Tensor &dstKVCacheLengths, int32_t oldActiveBatch, int32_t newActiveBatch, bool updateLengths, cudaStream_t stream )#

Compact a single layer’s KV cache by removing evicted batches.

Single-layer variant of compactKVCache for per-layer heterogeneous KV cache.

Parameters:

kvCacheLayer – [maxBatch, 2, numKVHeads, maxSeq, headDim] single-layer buffer (in/out)
batchMapping – [oldActiveBatch] GPU tensor, mapping[i] = newBatchIdx or -1 (evict)
kvCacheLengths – [maxBatch] GPU tensor of sequence lengths (const input)
dstKVCacheLengths – [maxBatch] GPU tensor for compacted lengths (output, may alias kvCacheLengths)
oldActiveBatch – Number of batches before eviction
newActiveBatch – Number of batches after eviction
updateLengths – If true, update dstKVCacheLengths (only first layer should do this)
stream – CUDA stream

void trt_edgellm::kernel::compactTensorBatch( rt::Tensor const &src, rt::Tensor const &batchMapping, rt::Tensor &dst, int32_t oldActiveBatch, int32_t newActiveBatch, cudaStream_t stream )#

Generic tensor compaction along batch dimension.

This kernel compacts a tensor by removing evicted batches.

Note

Assumes batch dimension is the first dimension (dim 0)

Note

For in-place operation, pass the same tensor as both src and dst

Parameters:

src – Source tensor (const input)
batchMapping – [oldActiveBatch] GPU tensor (const input), mapping[i] = newBatchIdx or -1
dst – Destination tensor (output, can be same as src for in-place operation)
oldActiveBatch – Number of batches before eviction
newActiveBatch – Number of batches after eviction
stream – CUDA stream

Throws:

std::runtime_error – if tensors are not located on the GPU, or tensor shapes are invalid

void trt_edgellm::kernel::compactKVCacheBatched( KVLayerInfo const *layerInfos, rt::Tensor const &batchMapping, rt::Tensor const &kvCacheLengths, int32_t numLayers, int32_t headDim, nvinfer1::DataType kvCacheType, int32_t maxKVHeads, int32_t maxBatchSize, int32_t oldActiveBatch, int32_t newActiveBatch, cudaStream_t stream )#

Batched compaction across multiple layers in a single kernel launch.

All layers in the batch must share the same headDim (template-selected). Layers may have different numKVHeads and maxSeqLen.

Parameters:

layerInfos – [numLayers] GPU array of KVLayerInfo
batchMapping – [oldActiveBatch] GPU tensor
kvCacheLengths – [maxBatch] GPU tensor of sequence lengths
numLayers – Number of layers in this batch
headDim – Head dimension (same for all layers in batch)
kvCacheType – KV cache storage dtype (kHALF or kFP8); controls element size for stride calculation
maxKVHeads – Maximum numKVHeads across all layers (for grid sizing)
maxBatchSize – Max batch size
oldActiveBatch – Batches before eviction
newActiveBatch – Batches after eviction
stream – CUDA stream