Moe Sigmoid Group Topk Kernels#

void trt_edgellm::kernel::moeSigmoidGroupTopk(
rt::Tensor const &gatingOutput,
rt::Tensor &topkWeights,
rt::Tensor &topkIndices,
int32_t topK,
int32_t nGroup,
int32_t topkGroup,
bool normTopkProb,
float routedScalingFactor,
cudaStream_t stream,
rt::OptionalInputTensor correctionBias = std::nullopt
)#

MoE Sigmoid Group TopK kernel implementing HuggingFace NemotronH routing.

This kernel implements the grouped top-k routing algorithm from NemotronHMoE:

  1. Applies sigmoid to router logits: scores = sigmoid(logits)

  2. Adds optional correction bias: biased = scores + bias

  3. Groups experts, finds top-2 per group, sums -> groupScores

  4. Selects topkGroup groups with highest groupScores

  5. Masks experts NOT in selected groups

  6. Selects topK experts from masked biased scores

  7. Gathers weights from ORIGINAL sigmoid scores (not biased)

  8. Optionally renormalizes weights to sum to 1

  9. Scales weights by routedScalingFactor

Parameters:
  • gatingOutput – Input router logits [numTokens, numExperts] (FP32, GPU)

  • topkWeights – Output selected expert weights [numTokens, topK] (FP32, GPU)

  • topkIndices – Output selected expert indices [numTokens, topK] (INT32, GPU)

  • topK – Number of experts to select per token

  • nGroup – Number of expert groups

  • topkGroup – Number of groups to select

  • normTopkProb – Whether to renormalize topK weights to sum to 1

  • routedScalingFactor – Scaling factor applied to final weights

  • stream – CUDA stream for execution

  • correctionBias – Optional bias tensor [numExperts] for expert load balancing (FP32, GPU)