Util Kernels#

void trt_edgellm::kernel::calCuQCuKVSeqLensAndKVEndIdxs(
rt::Tensor const &inputSeqLen,
rt::Tensor const &kvCacheStartIndices,
rt::Tensor &cuQSeqLens,
rt::Tensor &cuKVSeqLens,
rt::Tensor &kvCacheEndIdxs,
int32_t const runtimeSeqLen,
cudaStream_t stream
)#

Host-side wrapper that launches a lightweight CUDA kernel to compute prefix-sum of sequence lengths and KV cache end indices.

Note

kvCacheStartIndices is optional. If it is not provided, kvStartIndices will be assumed to be 0.

Parameters:
  • inputSeqLen[in] int32_t tensor with shape [B]. Actual token length of each request.

  • kvCacheStartIndices[in] int32_t tensor with shape [B]. Start index of KV cache for each request. (optional, pass in empty tensor to indicate zero start indices)

  • cuQSeqLens[out] int32_t tensor with shape [B+1]. Exclusive prefix-sum of inputSeqLen.

  • cuKVSeqLens[out] int32_t tensor with shape [B+1]. Exclusive prefix-sum of (kvCacheStartIndices[i] + inputSeqLen[i]). If kvCacheStartIndices is empty, this will be exclusive prefix-sum of inputSeqLen.

  • kvCacheEndIdxs[out] int32_t tensor with shape [B]. Each element equals kvCacheStartIndices[i] + runtimeSeqLen (Here we use padding to ease later kernel launch).

  • runtimeSeqLen[in] Runtime sequence length (equals to the maximum of inputSeqLen).

  • stream[in] CUDA stream used to launch the kernel.

void trt_edgellm::kernel::cvtKVLayoutBHSDToBSHD(
rt::Tensor const &src,
rt::Tensor &dst,
cudaStream_t stream
)#

Converts KV cache layout from BHSD layout to BSHD layout for attention computation.

Converts an input tensor in [B, 2, H, S, D] into [B, S, 2, H, D].

Template Parameters:

T – Element type (e.g. float, half, bfloat16, etc.).

Parameters:
  • src[in] Source tensor with shape [B, 2, H, S, D].

  • dst[out] Destination tensor with shape [B, S, 2, H, D].

  • stream[in] CUDA stream to launch the kernel on