Moe Gather#

void trt_edgellm::kernel::launchMoeGather(
rt::Tensor const &srcFP4,
rt::Tensor &dstFP4,
rt::Tensor const &srcSF,
rt::Tensor &dstSF,
rt::Tensor const &permuteMap,
int32_t permutedM,
int32_t topK,
int32_t hiddenSize,
cudaStream_t stream
)#

Launch the MoE gather kernel.

Permutes FP4 data + atom-layout SF from token order to expert-grouped order. One CTA per output row, 256 threads. Caller must pre-zero dstSF.

Parameters:
  • srcFP4 – Packed FP4 source data (viewed as int32_t*)

  • dstFP4 – Packed FP4 output data (viewed as int32_t*)

  • srcSF – Atom-layout SF buffer (source, viewed as int32_t*)

  • dstSF – Atom-layout SF buffer (dest, viewed as int32_t*, must be pre-zeroed)

  • permuteMap – INT32 permutation map (-1 = padding). May be sized larger than permutedM; only the first permutedM entries are read.

  • permutedM – Number of dst rows to process (must fit dst buffer shape)

  • topK – Experts per token

  • hiddenSize – Hidden dimension K

  • stream – CUDA stream