Util Kernels#
- void trt_edgellm::kernel::calCuQCuKVSeqLensAndKVEndIdxs(
- rt::Tensor const &inputSeqLen,
- rt::Tensor const &kvCacheStartIndices,
- rt::Tensor &cuQSeqLens,
- rt::Tensor &cuKVSeqLens,
- rt::Tensor &kvCacheEndIdxs,
- rt::OptionalOutputTensor paddedCuKVSeqLens,
- int32_t const runtimeSeqLen,
- cudaStream_t stream
Host-side wrapper that launches a lightweight CUDA kernel to compute prefix-sum of sequence lengths and KV cache end indices.
Note
kvCacheStartIndices is optional. If it is not provided, kvStartIndices will be assumed to be 0.
- Parameters:
inputSeqLen – [in] int32_t tensor with shape [B]. Actual token length of each request.
kvCacheStartIndices – [in] int32_t tensor with shape [B]. Start index of KV cache for each request. (optional, pass in empty tensor to indicate zero start indices)
cuQSeqLens – [out] int32_t tensor with shape [B+1]. Exclusive prefix-sum of inputSeqLen.
cuKVSeqLens – [out] int32_t tensor with shape [B+1]. Exclusive prefix-sum of (kvCacheStartIndices[i] + inputSeqLen[i]). If kvCacheStartIndices is empty, this will be exclusive prefix-sum of inputSeqLen.
kvCacheEndIdxs – [out] int32_t tensor with shape [B]. Each element equals kvCacheStartIndices[i] + runtimeSeqLen (Here we use padding to ease later kernel launch).
paddedCuKVSeqLens – [out] (optional) int32_t tensor with shape [B+1]. Exclusive prefix-sum of kvCacheEndIdxs (= kvCacheStartIdx + runtimeSeqLen per batch). Pass std::nullopt to skip. Background: CuTe DSL FMHA kernel uses bottom_right_align with offset = s_k - s_q. Q is padded to runtimeSeqLen for all batches, so we must use padded KV lengths (s_k = kvCacheEndIdx per batch) to keep offset non-negative. Using actual s_k (< runtimeSeqLen for shorter batches) would produce a negative offset that masks out valid KV positions, breaking attention.
runtimeSeqLen – [in] Runtime sequence length (equals to the maximum of inputSeqLen).
stream – [in] CUDA stream used to launch the kernel.
- Throws:
std::runtime_error – if tensor shapes are invalid
- void trt_edgellm::kernel::cvtKVLayoutBHSDToSplitKV(
- rt::Tensor const &src,
- rt::Tensor &kDst,
- rt::Tensor &vDst,
- rt::Tensor const &kvScaleQuantOrig,
- cudaStream_t stream
Converts KV cache layout from [B, 2, H, S, D] into separate K and V tensors of shape [B, S, H, D].
Splits the interleaved KV source into two independent FP16 output tensors, applying FP8 dequantization when the source is FP8. Used in the chunked-prefill path so that the SEPARATE_Q_K_V FMHA kernels receive separate K and V pointers.
- Parameters:
src – [in] Source tensor with shape [B, 2, H, S, D].
kDst – [out] Destination K tensor with shape [B, S, H, D] (FP16).
vDst – [out] Destination V tensor with shape [B, S, H, D] (FP16).
kvScaleQuantOrig – [in] Optional packed dequant scale tensor for FP8 KV cache (shape [2], float). Layout: [kScaleQuantOrig, vScaleQuantOrig]. Pass an empty tensor for FP16 src.
stream – [in] CUDA stream to launch the kernel on.
- Throws:
std::runtime_error – if tensor shapes or data types are invalid.