Util Kernels#
- void trt_edgellm::kernel::calCuQCuKVSeqLensAndKVEndIdxs(
- rt::Tensor const &inputSeqLen,
- rt::Tensor const &kvCacheStartIndices,
- rt::Tensor &cuQSeqLens,
- rt::Tensor &cuKVSeqLens,
- rt::Tensor &kvCacheEndIdxs,
- int32_t const runtimeSeqLen,
- cudaStream_t stream
Host-side wrapper that launches a lightweight CUDA kernel to compute prefix-sum of sequence lengths and KV cache end indices.
Note
kvCacheStartIndices is optional. If it is not provided, kvStartIndices will be assumed to be 0.
- Parameters:
inputSeqLen – [in] int32_t tensor with shape [B]. Actual token length of each request.
kvCacheStartIndices – [in] int32_t tensor with shape [B]. Start index of KV cache for each request. (optional, pass in empty tensor to indicate zero start indices)
cuQSeqLens – [out] int32_t tensor with shape [B+1]. Exclusive prefix-sum of inputSeqLen.
cuKVSeqLens – [out] int32_t tensor with shape [B+1]. Exclusive prefix-sum of (kvCacheStartIndices[i] + inputSeqLen[i]). If kvCacheStartIndices is empty, this will be exclusive prefix-sum of inputSeqLen.
kvCacheEndIdxs – [out] int32_t tensor with shape [B]. Each element equals kvCacheStartIndices[i] + runtimeSeqLen (Here we use padding to ease later kernel launch).
runtimeSeqLen – [in] Runtime sequence length (equals to the maximum of inputSeqLen).
stream – [in] CUDA stream used to launch the kernel.
- void trt_edgellm::kernel::cvtKVLayoutBHSDToBSHD( )#
Converts KV cache layout from BHSD layout to BSHD layout for attention computation.
Converts an input tensor in [B, 2, H, S, D] into [B, S, 2, H, D].
- Template Parameters:
T – Element type (e.g. float, half, bfloat16, etc.).
- Parameters:
src – [in] Source tensor with shape [B, 2, H, S, D].
dst – [out] Destination tensor with shape [B, S, 2, H, D].
stream – [in] CUDA stream to launch the kernel on