Apply Rope Write KV#
- void trt_edgellm::kernel::launchApplyRopeWriteKVPackedQKV( )#
Launch kernel to handle case where KVCache is empty. We will instantiate the KVCache and overwrite the QKV tensor directly.
- Parameters:
cosSinCache – [in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]
qkv – [inout] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq + Hk + Hv, headDim], the tensor will perform inplace update.
kvCache – [out] FP16 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim], write KVCache from the start positions.
stream – [in] CUDA stream to launch the kernel
- void trt_edgellm::kernel::launchApplyRopeWriteKVContinuousQAndKVCache(
- rt::Tensor const &cosSinCache,
- rt::Tensor const &kvCacheEndLens,
- rt::Tensor &qkv,
- rt::Tensor &kvCache,
- rt::Tensor &qOut,
- cudaStream_t stream
Launch the kernel to handle case where KVCache is not empty. We will write to a dedicated Q tensor and KVCache.
Note
We won’t overwrite QKV tensor in this case but we use Tensor& signature to reduce duplicate code.
- Parameters:
cosSinCache – [in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]
kvCacheEndLens – [in] INT32 type tensor with layout of [batchSize], the end position of KVCache after writing.
qkv – [in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq + Hk + Hv, headDim]
kvCache – [out] FP16 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim], write KVCache from the end position.
qOut – [out] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq, headDim], the output Q tensor.
stream – [in] CUDA stream to launch the kernel
- void trt_edgellm::kernel::launchApplyRopeWriteKVTreeDecoding(
- rt::Tensor const &cosSinCache,
- rt::Tensor const &kvCacheEndLens,
- rt::Tensor const &tokenPosIds,
- rt::Tensor &qkv,
- rt::Tensor &kvCache,
- rt::Tensor &qOut,
- cudaStream_t stream
Launch the kernel when we are performing tree attention for speculative decoding.
Note
We won’t overwrite QKV tensor in this case but we use Tensor& signature to reduce duplicate code.
- Parameters:
cosSinCache – [in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]
kvCacheEndLens – [in] INT32 type tensor with layout of [batchSize], the end position of KVCache after writing.
tokenPosIds – [in] INT32 type tensor with layout of [batchSize, runtimeSeqLen], the position of token within sequence.
qkv – [in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq + Hk + Hv, headDim]
kvCache – [out] FP16 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim], write KVCache from the end position.
qOut – [out] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq, headDim], the output Q tensor.
stream – [in] CUDA stream to launch the kernel