Apply Rope Write KV#

void trt_edgellm::kernel::launchApplyRopeWriteKV(
rt::Tensor const &cosSinCache,
rt::OptionalInputTensor kvCacheEndLens,
rt::Tensor &q,
rt::Tensor &k,
rt::Tensor const &v,
rt::Tensor &kvCache,
rt::Tensor const &kvScaleQuantOrig,
cudaStream_t stream,
bool writeKInPlace
)#

Launch kernel to apply RoPE positional encoding to Q/K and write K/V to KVCache.

Parameters:
  • cosSinCache[in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]

  • kvCacheEndLens[in] Optional INT32 type tensor with layout of [batchSize], the end position of KVCache after writing. When nullopt, KVCache is written from the start (prefill without prior cache).

  • q[inout] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq, headDim]

  • k[inout] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hkv, headDim]

  • v[in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hkv, headDim]

  • kvCache[out] FP16/FP8 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim]

  • kvScaleQuantOrig[in] FP32 type tensor with layout of [2] for FP8 KV cache quantization scales. Empty for FP16.

  • stream[in] CUDA stream to launch the kernel

  • writeKInPlace[in] Controls whether roped K is additionally written back to the K tensor in-place, on top of always being written to kvCache. Set to true for the initial prefill path (SEPARATE_Q_K_V) where the downstream FMHA kernel reads Q, K, V as separate contiguous tensors rather than from the KV cache. In this case K must contain the roped result. Set to false (default) for chunked prefill with KV cache reuse, where FMHA reads KV from the transposed KV cache, and for all decoding paths (vanilla / tree), where the XQA kernel reads KV from the cache.

Throws:

std::runtime_error – if tensor shape or data type is incorrect

void trt_edgellm::kernel::launchApplyRopeWriteKVTreeDecoding(
rt::Tensor const &cosSinCache,
rt::Tensor const &kvCacheEndLens,
rt::Tensor const &tokenPosIds,
rt::Tensor &q,
rt::Tensor &k,
rt::Tensor const &v,
rt::Tensor &kvCache,
rt::Tensor const &kvScaleQuantOrig,
cudaStream_t stream
)#

Launch the kernel when we are performing tree attention for speculative decoding.

Note

We won’t overwrite K/V tensor in this case but we use Tensor& signature to reduce duplicate code.

Parameters:
  • cosSinCache[in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]

  • kvCacheEndLens[in] INT32 type tensor with layout of [batchSize], the end position of KVCache after writing.

  • tokenPosIds[in] INT32 type tensor with layout of [batchSize, runtimeSeqLen], the position of token within sequence.

  • q[inout] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq, headDim]

  • k[in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hkv, headDim]

  • v[in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hkv, headDim]

  • kvCache[out] FP16/FP8 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim], write KVCache from the end position.

  • kvScaleQuantOrig[in] FP32 type tensor with layout of [2] for FP8 KV cache quantization scales. Empty for FP16.

  • stream[in] CUDA stream to launch the kernel

Throws:

std::runtime_error – if tensor shape or data type is incorrect

void trt_edgellm::kernel::launchApplyRopeWriteKVSplitQKV(
rt::Tensor const &cosSinCache,
rt::Tensor const &kvCacheEndLens,
rt::Tensor &q,
rt::Tensor const &k,
rt::Tensor const &v,
rt::Tensor &kvCache,
rt::Tensor const &kvScaleQuantOrig,
cudaStream_t stream
)#

Launch kernel to apply RoPE to Q (in-place), apply RoPE to K and write K/V to KVCache.

Optimized for the CuTe DSL FMHA path: applies RoPE to Q in-place, writes roped K and V into KV cache [B, 2, H_kv, S, D]. Does NOT write roped K back to the K input tensor. The downstream FMHA kernel reads Q from the Q tensor and K/V from the KV cache directly.

Parameters:
  • cosSinCache[in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]

  • kvCacheEndLens[in] INT32 type tensor with layout of [batchSize], the end position of KVCache after writing.

  • q[inout] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq, headDim]. RoPE applied in-place.

  • k[in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hkv, headDim]

  • v[in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hkv, headDim]

  • kvCache[out] FP16/FP8 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim]

  • kvScaleQuantOrig[in] FP32 type tensor with layout of [2] for FP8 KV cache quantization scales. Empty for FP16.

  • stream[in] CUDA stream to launch the kernel