Util Kernels#

void trt_edgellm::kernel::calCuQCuKVSeqLensAndKVEndIdxs(
rt::Tensor const &inputSeqLen,
rt::Tensor const &kvCacheStartIndices,
rt::Tensor &cuQSeqLens,
rt::Tensor &cuKVSeqLens,
rt::Tensor &kvCacheEndIdxs,
rt::OptionalOutputTensor paddedCuKVSeqLens,
int32_t const runtimeSeqLen,
cudaStream_t stream
)#

Host-side wrapper that launches a lightweight CUDA kernel to compute prefix-sum of sequence lengths and KV cache end indices.

Note

kvCacheStartIndices is optional. If it is not provided, kvStartIndices will be assumed to be 0.

Parameters:
  • inputSeqLen[in] int32_t tensor with shape [B]. Actual token length of each request.

  • kvCacheStartIndices[in] int32_t tensor with shape [B]. Start index of KV cache for each request. (optional, pass in empty tensor to indicate zero start indices)

  • cuQSeqLens[out] int32_t tensor with shape [B+1]. Exclusive prefix-sum of inputSeqLen.

  • cuKVSeqLens[out] int32_t tensor with shape [B+1]. Exclusive prefix-sum of (kvCacheStartIndices[i] + inputSeqLen[i]). If kvCacheStartIndices is empty, this will be exclusive prefix-sum of inputSeqLen.

  • kvCacheEndIdxs[out] int32_t tensor with shape [B]. Each element equals kvCacheStartIndices[i] + runtimeSeqLen (Here we use padding to ease later kernel launch).

  • paddedCuKVSeqLens[out] (optional) int32_t tensor with shape [B+1]. Exclusive prefix-sum of kvCacheEndIdxs (= kvCacheStartIdx + runtimeSeqLen per batch). Pass std::nullopt to skip. Background: CuTe DSL FMHA kernel uses bottom_right_align with offset = s_k - s_q. Q is padded to runtimeSeqLen for all batches, so we must use padded KV lengths (s_k = kvCacheEndIdx per batch) to keep offset non-negative. Using actual s_k (< runtimeSeqLen for shorter batches) would produce a negative offset that masks out valid KV positions, breaking attention.

  • runtimeSeqLen[in] Runtime sequence length (equals to the maximum of inputSeqLen).

  • stream[in] CUDA stream used to launch the kernel.

Throws:

std::runtime_error – if tensor shapes are invalid

void trt_edgellm::kernel::cvtKVLayoutBHSDToSplitKV(
rt::Tensor const &src,
rt::Tensor &kDst,
rt::Tensor &vDst,
rt::Tensor const &kvScaleQuantOrig,
cudaStream_t stream
)#

Converts KV cache layout from [B, 2, H, S, D] into separate K and V tensors of shape [B, S, H, D].

Splits the interleaved KV source into two independent FP16 output tensors, applying FP8 dequantization when the source is FP8. Used in the chunked-prefill path so that the SEPARATE_Q_K_V FMHA kernels receive separate K and V pointers.

Parameters:
  • src[in] Source tensor with shape [B, 2, H, S, D].

  • kDst[out] Destination K tensor with shape [B, S, H, D] (FP16).

  • vDst[out] Destination V tensor with shape [B, S, H, D] (FP16).

  • kvScaleQuantOrig[in] Optional packed dequant scale tensor for FP8 KV cache (shape [2], float). Layout: [kScaleQuantOrig, vScaleQuantOrig]. Pass an empty tensor for FP16 src.

  • stream[in] CUDA stream to launch the kernel on.

Throws:

std::runtime_error – if tensor shapes or data types are invalid.