Moe Topk Softmax Kernels#
- size_t trt_edgellm::kernel::getMoeTopkSoftmaxWorkspaceSize(
- int32_t numTokens,
- int32_t numExperts
Calculate workspace size required for MoE TopK Softmax kernel.
The workspace is only needed when numExperts is not a power of 2 (1-256), in which case a fallback path using separate softmax and topk kernels is used.
- Parameters:
numTokens – Number of tokens to process
numExperts – Number of experts in the MoE layer
- Returns:
Required workspace size in bytes (0 if optimized path is used)
- void trt_edgellm::kernel::moeTopkSoftmax(
- rt::Tensor const &gatingOutput,
- rt::Tensor &topkWeights,
- rt::Tensor &topkIndices,
- int32_t topk,
- void *workspace,
- size_t workspaceSize,
- cudaStream_t stream,
- bool renormalize = false,
- float moeSoftcapping = 0.0f,
- rt::OptionalInputTensor correctionBias = std::nullopt
MoE TopK Softmax kernel for Mixture of Experts gating.
This kernel implements the gating mechanism for MoE layers:
Takes gating logits of shape [numTokens, numExperts]
Applies optional tanh softcapping to limit logit range
Applies optional correction bias for expert load balancing
Computes softmax over the expert dimension
Selects top-k experts with highest probabilities
Optionally renormalizes the selected weights to sum to 1
Algorithm:
For power-of-2 experts (1-256): Uses optimized fused kernel with warp-level parallelism
For other expert counts: Falls back to separate softmax + topk kernels
Optimizations:
Fused softmax and top-k selection in a single kernel pass
Warp-level butterfly reductions (no shared memory needed)
Vectorized memory loads for better bandwidth utilization
Multiple rows processed per warp for high occupancy
Note
All tensor parameters must be allocated on GPU device
Note
Workspace is only required when numExperts is not a power of 2 in range [1, 256]
Note
Use getMoeTopkSoftmaxWorkspaceSize() to determine required workspace size
- Parameters:
gatingOutput – Input gating logits [numTokens, numExperts] (FP32/FP16/BF16, GPU)
topkWeights – Output selected expert weights [numTokens, topk] (FP32, GPU)
topkIndices – Output selected expert indices [numTokens, topk] (INT32, GPU)
topk – Number of experts to select per token
workspace – Workspace buffer for fallback path (can be nullptr if not needed)
workspaceSize – Size of workspace buffer in bytes
stream – CUDA stream for execution
renormalize – Whether to renormalize topk weights to sum to 1 (default: false)
moeSoftcapping – Softcapping value (0.0 to disable): val = tanh(val/cap) * cap (default: 0.0)
correctionBias – Optional bias tensor [numExperts] for expert load balancing (FP32, GPU)