Moe Gather#
- void trt_edgellm::kernel::launchMoeGather(
- rt::Tensor const &srcFP4,
- rt::Tensor &dstFP4,
- rt::Tensor const &srcSF,
- rt::Tensor &dstSF,
- rt::Tensor const &permuteMap,
- int32_t permutedM,
- int32_t topK,
- int32_t hiddenSize,
- cudaStream_t stream
Launch the MoE gather kernel.
Permutes FP4 data + atom-layout SF from token order to expert-grouped order. One CTA per output row, 256 threads. Caller must pre-zero dstSF.
- Parameters:
srcFP4 – Packed FP4 source data (viewed as int32_t*)
dstFP4 – Packed FP4 output data (viewed as int32_t*)
srcSF – Atom-layout SF buffer (source, viewed as int32_t*)
dstSF – Atom-layout SF buffer (dest, viewed as int32_t*, must be pre-zeroed)
permuteMap – INT32 permutation map (-1 = padding). May be sized larger than
permutedM; only the firstpermutedMentries are read.permutedM – Number of dst rows to process (must fit dst buffer shape)
topK – Experts per token
hiddenSize – Hidden dimension K
stream – CUDA stream