Moe Topk Softmax Kernels#

size_t trt_edgellm::kernel::getMoeTopkSoftmaxWorkspaceSize(
int32_t numTokens,
int32_t numExperts
)#

Calculate workspace size required for MoE TopK Softmax kernel.

The workspace is only needed when numExperts is not a power of 2 (1-256), in which case a fallback path using separate softmax and topk kernels is used.

Parameters:
  • numTokens – Number of tokens to process

  • numExperts – Number of experts in the MoE layer

Returns:

Required workspace size in bytes (0 if optimized path is used)

void trt_edgellm::kernel::moeTopkSoftmax(
rt::Tensor const &gatingOutput,
rt::Tensor &topkWeights,
rt::Tensor &topkIndices,
int32_t topk,
void *workspace,
size_t workspaceSize,
cudaStream_t stream,
bool renormalize = false,
float moeSoftcapping = 0.0f,
rt::OptionalInputTensor correctionBias = std::nullopt
)#

MoE TopK Softmax kernel for Mixture of Experts gating.

This kernel implements the gating mechanism for MoE layers:

  1. Takes gating logits of shape [numTokens, numExperts]

  2. Applies optional tanh softcapping to limit logit range

  3. Applies optional correction bias for expert load balancing

  4. Computes softmax over the expert dimension

  5. Selects top-k experts with highest probabilities

  6. Optionally renormalizes the selected weights to sum to 1

Algorithm:

  • For power-of-2 experts (1-256): Uses optimized fused kernel with warp-level parallelism

  • For other expert counts: Falls back to separate softmax + topk kernels

Optimizations:

  • Fused softmax and top-k selection in a single kernel pass

  • Warp-level butterfly reductions (no shared memory needed)

  • Vectorized memory loads for better bandwidth utilization

  • Multiple rows processed per warp for high occupancy

Note

All tensor parameters must be allocated on GPU device

Note

Workspace is only required when numExperts is not a power of 2 in range [1, 256]

Note

Use getMoeTopkSoftmaxWorkspaceSize() to determine required workspace size

Parameters:
  • gatingOutput – Input gating logits [numTokens, numExperts] (FP32/FP16/BF16, GPU)

  • topkWeights – Output selected expert weights [numTokens, topk] (FP32, GPU)

  • topkIndices – Output selected expert indices [numTokens, topk] (INT32, GPU)

  • topk – Number of experts to select per token

  • workspace – Workspace buffer for fallback path (can be nullptr if not needed)

  • workspaceSize – Size of workspace buffer in bytes

  • stream – CUDA stream for execution

  • renormalize – Whether to renormalize topk weights to sum to 1 (default: false)

  • moeSoftcapping – Softcapping value (0.0 to disable): val = tanh(val/cap) * cap (default: 0.0)

  • correctionBias – Optional bias tensor [numExperts] for expert load balancing (FP32, GPU)