Apply Rope Write KV#

void trt_edgellm::kernel::launchApplyRopeWriteKVPackedQKV( rt::Tensor const &cosSinCache, rt::Tensor &qkv, rt::Tensor &kvCache, cudaStream_t stream )#

Launch kernel to handle case where KVCache is empty. We will instantiate the KVCache and overwrite the QKV tensor directly.

Parameters:

cosSinCache – [in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]
qkv – [inout] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq + Hk + Hv, headDim], the tensor will perform inplace update.
kvCache – [out] FP16 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim], write KVCache from the start positions.
stream – [in] CUDA stream to launch the kernel

void trt_edgellm::kernel::launchApplyRopeWriteKVContinuousQAndKVCache( rt::Tensor const &cosSinCache, rt::Tensor const &kvCacheEndLens, rt::Tensor &qkv, rt::Tensor &kvCache, rt::Tensor &qOut, cudaStream_t stream )#

Launch the kernel to handle case where KVCache is not empty. We will write to a dedicated Q tensor and KVCache.

Note

We won’t overwrite QKV tensor in this case but we use Tensor& signature to reduce duplicate code.

Parameters:

cosSinCache – [in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]
kvCacheEndLens – [in] INT32 type tensor with layout of [batchSize], the end position of KVCache after writing.
qkv – [in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq + Hk + Hv, headDim]
kvCache – [out] FP16 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim], write KVCache from the end position.
qOut – [out] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq, headDim], the output Q tensor.
stream – [in] CUDA stream to launch the kernel

void trt_edgellm::kernel::launchApplyRopeWriteKVTreeDecoding( rt::Tensor const &cosSinCache, rt::Tensor const &kvCacheEndLens, rt::Tensor const &tokenPosIds, rt::Tensor &qkv, rt::Tensor &kvCache, rt::Tensor &qOut, cudaStream_t stream )#

Launch the kernel when we are performing tree attention for speculative decoding.

Note

We won’t overwrite QKV tensor in this case but we use Tensor& signature to reduce duplicate code.

Parameters:

cosSinCache – [in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]
kvCacheEndLens – [in] INT32 type tensor with layout of [batchSize], the end position of KVCache after writing.
tokenPosIds – [in] INT32 type tensor with layout of [batchSize, runtimeSeqLen], the position of token within sequence.
qkv – [in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq + Hk + Hv, headDim]
kvCache – [out] FP16 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim], write KVCache from the end position.
qOut – [out] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq, headDim], the output Q tensor.
stream – [in] CUDA stream to launch the kernel