Apply Rope Write KV#

void trt_edgellm::kernel::launchApplyRopeWriteKV(
rt::Tensor const &cosSinCache,
rt::OptionalInputTensor kvCacheEndLens,
rt::Tensor &q,
rt::Tensor &k,
rt::Tensor const &v,
rt::Tensor &kvCache,
float kScale,
float vScale,
cudaStream_t stream,
bool writeKInPlace
)#

Launch kernel to apply RoPE positional encoding to Q/K and write K/V to KVCache.

Parameters:
  • cosSinCache[in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]

  • kvCacheEndLens[in] Optional INT32 type tensor with layout of [batchSize], the end position of KVCache after writing. When nullopt, KVCache is written from the start (prefill without prior cache).

  • q[inout] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq, headDim]

  • k[inout] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hkv, headDim]

  • v[in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hkv, headDim]

  • kvCache[out] FP16/FP8 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim]

  • kScale[in] K dequant scale (quant→orig). Use 1.0f for FP16 KV cache.

  • vScale[in] V dequant scale (quant→orig). Use 1.0f for FP16 KV cache.

  • stream[in] CUDA stream to launch the kernel

  • writeKInPlace[in] Controls whether roped K is additionally written back to the K tensor in-place, on top of always being written to kvCache. Set to true for the initial prefill path (SEPARATE_Q_K_V) where the downstream FMHA kernel reads Q, K, V as separate contiguous tensors rather than from the KV cache. In this case K must contain the roped result. Set to false (default) for chunked prefill with KV cache reuse, where FMHA reads KV from the transposed KV cache, and for all decoding paths (vanilla / tree), where the XQA kernel reads KV from the cache.

Throws:

std::runtime_error – if tensor shape or data type is incorrect

void trt_edgellm::kernel::launchApplyRopeWriteKVTreeDecoding(
rt::Tensor const &cosSinCache,
rt::Tensor const &kvCacheEndLens,
rt::Tensor const &tokenPosIds,
rt::Tensor &q,
rt::Tensor &k,
rt::Tensor const &v,
rt::Tensor &kvCache,
float kScale,
float vScale,
cudaStream_t stream
)#

Launch the kernel when we are performing tree attention for speculative decoding.

Note

We won’t overwrite K/V tensor in this case but we use Tensor& signature to reduce duplicate code.

Parameters:
  • cosSinCache[in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]

  • kvCacheEndLens[in] INT32 type tensor with layout of [batchSize], the end position of KVCache after writing.

  • tokenPosIds[in] INT32 type tensor with layout of [batchSize, runtimeSeqLen], the position of token within sequence.

  • q[inout] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq, headDim]

  • k[in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hkv, headDim]

  • v[in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hkv, headDim]

  • kvCache[out] FP16/FP8 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim], write KVCache from the end position.

  • kScale[in] K dequant scale (quant→orig). Use 1.0f for FP16 KV cache.

  • vScale[in] V dequant scale (quant→orig). Use 1.0f for FP16 KV cache.

  • stream[in] CUDA stream to launch the kernel

Throws:

std::runtime_error – if tensor shape or data type is incorrect

void trt_edgellm::kernel::launchApplyRopeWriteKVSplitQKV(
rt::Tensor const &cosSinCache,
rt::Tensor const &kvCacheEndLens,
rt::Tensor &q,
rt::Tensor const &k,
rt::Tensor const &v,
rt::Tensor &kvCache,
float kScale,
float vScale,
cudaStream_t stream,
void *fp8QOut = nullptr,
float qScale = 1.0f
)#

Launch kernel to apply RoPE to Q, apply RoPE to K and write K/V to KVCache.

Optimized for the CuTe DSL FMHA path: applies RoPE to Q, writes roped K and V into KV cache [B, 2, H_kv, S, D]. Does NOT write roped K back to the K input tensor.

When fp8QOut is non-null (FP8 KV cache path), the roped Q is quantized to FP8 and written to the provided output buffer. The original FP16 Q tensor is NOT modified. The downstream FP8 FMHA kernel reads Q from fp8QOut and K/V from the KV cache directly.

When fp8QOut is null (FP16 path), RoPE is applied to Q in-place in the FP16 Q tensor.

Parameters:
  • cosSinCache[in] FP32 type tensor with layout of [cosSinCacheBatchSize, cosSinCacheSeqLen, rotaryDim]

  • kvCacheEndLens[in] INT32 type tensor with layout of [batchSize], the end position of KVCache after writing.

  • q[inout] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hq, headDim]. RoPE applied in-place when fp8QOut is null.

  • k[in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hkv, headDim]

  • v[in] FP16 type tensor with layout of [batchSize, runtimeSeqLen, Hkv, headDim]

  • kvCache[out] FP16/FP8 type tensor with layout of [batchSize, 2, Hkv, kvCacheCapacity, headDim]

  • kScale[in] K dequant scale (quant→orig). Use 1.0f for FP16 KV cache.

  • vScale[in] V dequant scale (quant→orig). Use 1.0f for FP16 KV cache.

  • stream[in] CUDA stream to launch the kernel

  • fp8QOut[out] Optional FP8 output buffer for roped Q [batchSize, runtimeSeqLen, Hq, headDim]. When non-null, roped Q is quantized to FP8 E4M3 and stored here. Pass nullptr for FP16 in-place RoPE.

  • qScale[in] Q dequant scale (quant→orig). Only used when fp8QOut is non-null.