Util Kernels#

void trt_edgellm::kernel::calCuQCuKVSeqLensAndKVEndIdxs( rt::Tensor const &inputSeqLen, rt::Tensor const &kvCacheStartIndices, rt::Tensor &cuQSeqLens, rt::Tensor &cuKVSeqLens, rt::Tensor &kvCacheEndIdxs, rt::OptionalOutputTensor paddedCuKVSeqLens, int32_t const runtimeSeqLen, cudaStream_t stream )#

Host-side wrapper that launches a lightweight CUDA kernel to compute prefix-sum of sequence lengths and KV cache end indices.

Note

kvCacheStartIndices is optional. If it is not provided, kvStartIndices will be assumed to be 0.

Parameters:

inputSeqLen – [in] int32_t tensor with shape [B]. Actual token length of each request.
kvCacheStartIndices – [in] int32_t tensor with shape [B]. Start index of KV cache for each request. (optional, pass in empty tensor to indicate zero start indices)
cuQSeqLens – [out] int32_t tensor with shape [B+1]. Exclusive prefix-sum of inputSeqLen.
cuKVSeqLens – [out] int32_t tensor with shape [B+1]. Exclusive prefix-sum of (kvCacheStartIndices[i] + inputSeqLen[i]). If kvCacheStartIndices is empty, this will be exclusive prefix-sum of inputSeqLen.
kvCacheEndIdxs – [out] int32_t tensor with shape [B]. Each element equals kvCacheStartIndices[i] + runtimeSeqLen (Here we use padding to ease later kernel launch).
paddedCuKVSeqLens – [out] (optional) int32_t tensor with shape [B+1]. Exclusive prefix-sum of kvCacheEndIdxs (= kvCacheStartIdx + runtimeSeqLen per batch). Pass std::nullopt to skip. Background: CuTe DSL FMHA kernel uses bottom_right_align with offset = s_k - s_q. Q is padded to runtimeSeqLen for all batches, so we must use padded KV lengths (s_k = kvCacheEndIdx per batch) to keep offset non-negative. Using actual s_k (< runtimeSeqLen for shorter batches) would produce a negative offset that masks out valid KV positions, breaking attention.
runtimeSeqLen – [in] Runtime sequence length (equals to the maximum of inputSeqLen).
stream – [in] CUDA stream used to launch the kernel.

Throws:

std::runtime_error – if tensor shapes are invalid

void trt_edgellm::kernel::cvtKVLayoutBHSDToSplitKV( rt::Tensor const &src, rt::Tensor &kDst, rt::Tensor &vDst, rt::Tensor const &kvScaleQuantOrig, cudaStream_t stream )#

Converts KV cache layout from [B, 2, H, S, D] into separate K and V tensors of shape [B, S, H, D].

Splits the interleaved KV source into two independent FP16 output tensors, applying FP8 dequantization when the source is FP8. Used in the chunked-prefill path so that the SEPARATE_Q_K_V FMHA kernels receive separate K and V pointers.

Parameters:

src – [in] Source tensor with shape [B, 2, H, S, D].
kDst – [out] Destination K tensor with shape [B, S, H, D] (FP16).
vDst – [out] Destination V tensor with shape [B, S, H, D] (FP16).
kvScaleQuantOrig – [in] Optional packed dequant scale tensor for FP8 KV cache (shape [2], float). Layout: [kScaleQuantOrig, vScaleQuantOrig]. Pass an empty tensor for FP16 src.
stream – [in] CUDA stream to launch the kernel on.

Throws:

std::runtime_error – if tensor shapes or data types are invalid.