Batch Evict Kernels#

struct KVLayerInfo#

Per-layer KV cache metadata for batched kernel operations.

Public Members

void *data#

Pointer to this layer’s KV buffer [maxB, 2, H, S, D].

int32_t numKVHeads#

Number of KV heads for this layer.

int32_t maxSeqLen#

Max sequence length for this layer.

void trt_edgellm::kernel::compactKVCacheSingleLayer(
rt::Tensor &kvCacheLayer,
rt::Tensor const &batchMapping,
rt::Tensor const &kvCacheLengths,
rt::Tensor &dstKVCacheLengths,
int32_t oldActiveBatch,
int32_t newActiveBatch,
bool updateLengths,
cudaStream_t stream
)#

Compact a single layer’s KV cache by removing evicted batches.

Single-layer variant of compactKVCache for per-layer heterogeneous KV cache.

Parameters:
  • kvCacheLayer – [maxBatch, 2, numKVHeads, maxSeq, headDim] single-layer buffer (in/out)

  • batchMapping – [oldActiveBatch] GPU tensor, mapping[i] = newBatchIdx or -1 (evict)

  • kvCacheLengths – [maxBatch] GPU tensor of sequence lengths (const input)

  • dstKVCacheLengths – [maxBatch] GPU tensor for compacted lengths (output, may alias kvCacheLengths)

  • oldActiveBatch – Number of batches before eviction

  • newActiveBatch – Number of batches after eviction

  • updateLengths – If true, update dstKVCacheLengths (only first layer should do this)

  • stream – CUDA stream

void trt_edgellm::kernel::compactTensorBatch(
rt::Tensor const &src,
rt::Tensor const &batchMapping,
rt::Tensor &dst,
int32_t oldActiveBatch,
int32_t newActiveBatch,
cudaStream_t stream
)#

Generic tensor compaction along batch dimension.

This kernel compacts a tensor by removing evicted batches.

Note

Assumes batch dimension is the first dimension (dim 0)

Note

For in-place operation, pass the same tensor as both src and dst

Parameters:
  • src – Source tensor (const input)

  • batchMapping – [oldActiveBatch] GPU tensor (const input), mapping[i] = newBatchIdx or -1

  • dst – Destination tensor (output, can be same as src for in-place operation)

  • oldActiveBatch – Number of batches before eviction

  • newActiveBatch – Number of batches after eviction

  • stream – CUDA stream

Throws:

std::runtime_error – if tensors are not located on the GPU, or tensor shapes are invalid

void trt_edgellm::kernel::compactKVCacheBatched(
KVLayerInfo const *layerInfos,
rt::Tensor const &batchMapping,
rt::Tensor const &kvCacheLengths,
int32_t numLayers,
int32_t headDim,
nvinfer1::DataType kvCacheType,
int32_t maxKVHeads,
int32_t maxBatchSize,
int32_t oldActiveBatch,
int32_t newActiveBatch,
cudaStream_t stream
)#

Batched compaction across multiple layers in a single kernel launch.

All layers in the batch must share the same headDim (template-selected). Layers may have different numKVHeads and maxSeqLen.

Parameters:
  • layerInfos – [numLayers] GPU array of KVLayerInfo

  • batchMapping – [oldActiveBatch] GPU tensor

  • kvCacheLengths – [maxBatch] GPU tensor of sequence lengths

  • numLayers – Number of layers in this batch

  • headDim – Head dimension (same for all layers in batch)

  • kvCacheType – KV cache storage dtype (kHALF or kFP8); controls element size for stride calculation

  • maxKVHeads – Maximum numKVHeads across all layers (for grid sizing)

  • maxBatchSize – Max batch size

  • oldActiveBatch – Batches before eviction

  • newActiveBatch – Batches after eviction

  • stream – CUDA stream