Gdn Kernel Utils#
- void trt_edgellm::launchGdnCalCuSeqLens(
- void const *context_lengths,
- void *cu_seqlens,
- int32_t batchSize,
- cudaStream_t stream
Launch the context_lengths → cu_seqlens prefix-sum kernel.
- void trt_edgellm::launchGdnL2NormQK(
- void *q,
- void *k,
- int32_t n,
- int32_t seqLen,
- int32_t h,
- int32_t headDim,
- cudaStream_t stream
L2-normalize Q and K in-place along the head dimension. Q, K: (N, seqLen, H, headDim) float16 — each token-head vector is divided by its L2 norm. Required preprocessing for the Blackwell GDN prefill kernel.
- void trt_edgellm::launchGdnStateTranspose(
- void const *src,
- void *dst,
- int32_t numBlocks,
- int32_t dim,
- cudaStream_t stream
Transpose the last two dimensions of the GDN state tensor (out-of-place). The Blackwell GDN prefill MMA produces state in V-major (d_v, d_k) order, while the sequential/decode kernels use K-major (d_k, d_v). src: (numBlocks, dim, dim) float32 — row-major 2-D blocks dst: (numBlocks, dim, dim) float32 — each block transposed numBlocks = n * hv, dim = head_dim (128).