Util Kernels#

void trt_edgellm::kernel::calCuQCuKVSeqLensAndKVEndIdxs( rt::Tensor const &inputSeqLen, rt::Tensor const &kvCacheStartIndices, rt::Tensor &cuQSeqLens, rt::Tensor &cuKVSeqLens, rt::Tensor &kvCacheEndIdxs, int32_t const runtimeSeqLen, cudaStream_t stream )#

Host-side wrapper that launches a lightweight CUDA kernel to compute prefix-sum of sequence lengths and KV cache end indices.

Note

kvCacheStartIndices is optional. If it is not provided, kvStartIndices will be assumed to be 0.

Parameters:

inputSeqLen – [in] int32_t tensor with shape [B]. Actual token length of each request.
kvCacheStartIndices – [in] int32_t tensor with shape [B]. Start index of KV cache for each request. (optional, pass in empty tensor to indicate zero start indices)
cuQSeqLens – [out] int32_t tensor with shape [B+1]. Exclusive prefix-sum of inputSeqLen.
cuKVSeqLens – [out] int32_t tensor with shape [B+1]. Exclusive prefix-sum of (kvCacheStartIndices[i] + inputSeqLen[i]). If kvCacheStartIndices is empty, this will be exclusive prefix-sum of inputSeqLen.
kvCacheEndIdxs – [out] int32_t tensor with shape [B]. Each element equals kvCacheStartIndices[i] + runtimeSeqLen (Here we use padding to ease later kernel launch).
runtimeSeqLen – [in] Runtime sequence length (equals to the maximum of inputSeqLen).
stream – [in] CUDA stream used to launch the kernel.

void trt_edgellm::kernel::cvtKVLayoutBHSDToBSHD( rt::Tensor const &src, rt::Tensor &dst, cudaStream_t stream )#

Converts KV cache layout from BHSD layout to BSHD layout for attention computation.

Converts an input tensor in [B, 2, H, S, D] into [B, S, 2, H, D].

Template Parameters:

T – Element type (e.g. float, half, bfloat16, etc.).

Parameters:

src – [in] Source tensor with shape [B, 2, H, S, D].
dst – [out] Destination tensor with shape [B, S, 2, H, D].
stream – [in] CUDA stream to launch the kernel on