Apply Rope Write KV#

void trt_edgellm::kernel::launchApplyRopeWriteKVPackedQKV(
rt::Tensor const &cosSinCache,
rt::Tensor &qkv,
rt::Tensor &kvCache,
cudaStream_t stream
)#

Launch kernel to handle case where KVCache is empty. We will instantiate the KVCache and overwrite the QKV tensor directly.

Parameters:
  • cosSinCache[in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]

  • qkv[inout] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq + Hk + Hv, headDim], the tensor will perform inplace update.

  • kvCache[out] FP16 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim], write KVCache from the start positions.

  • stream[in] CUDA stream to launch the kernel

void trt_edgellm::kernel::launchApplyRopeWriteKVContinuousQAndKVCache(
rt::Tensor const &cosSinCache,
rt::Tensor const &kvCacheEndLens,
rt::Tensor &qkv,
rt::Tensor &kvCache,
rt::Tensor &qOut,
cudaStream_t stream
)#

Launch the kernel to handle case where KVCache is not empty. We will write to a dedicated Q tensor and KVCache.

Note

We won’t overwrite QKV tensor in this case but we use Tensor& signature to reduce duplicate code.

Parameters:
  • cosSinCache[in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]

  • kvCacheEndLens[in] INT32 type tensor with layout of [batchSize], the end position of KVCache after writing.

  • qkv[in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq + Hk + Hv, headDim]

  • kvCache[out] FP16 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim], write KVCache from the end position.

  • qOut[out] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq, headDim], the output Q tensor.

  • stream[in] CUDA stream to launch the kernel

void trt_edgellm::kernel::launchApplyRopeWriteKVTreeDecoding(
rt::Tensor const &cosSinCache,
rt::Tensor const &kvCacheEndLens,
rt::Tensor const &tokenPosIds,
rt::Tensor &qkv,
rt::Tensor &kvCache,
rt::Tensor &qOut,
cudaStream_t stream
)#

Launch the kernel when we are performing tree attention for speculative decoding.

Note

We won’t overwrite QKV tensor in this case but we use Tensor& signature to reduce duplicate code.

Parameters:
  • cosSinCache[in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]

  • kvCacheEndLens[in] INT32 type tensor with layout of [batchSize], the end position of KVCache after writing.

  • tokenPosIds[in] INT32 type tensor with layout of [batchSize, runtimeSeqLen], the position of token within sequence.

  • qkv[in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq + Hk + Hv, headDim]

  • kvCache[out] FP16 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim], write KVCache from the end position.

  • qOut[out] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq, headDim], the output Q tensor.

  • stream[in] CUDA stream to launch the kernel