Nv Fp4 Mo Efc2 Finalize Runner#
-
class NvFP4MoEFC2FinalizeRunner#
Public Functions
- NvFP4MoEFC2FinalizeRunner(
- int32_t numLocalExperts,
- int32_t topK,
- int32_t n,
- int32_t k,
- OutputDType outDtype = OutputDType::kBF16
- Parameters:
numLocalExperts – Number of local experts (L)
topK – Routing factor
n – Hidden size (N — FC2 output dimension)
k – Intermediate size (K — FC2 input dimension)
outDtype – Output element type (selects the AOT variant).
- void run(
- void const *inputFP4,
- void const *weight,
- void const *inputSF,
- void const *weightSF,
- void *output,
- void const *alpha,
- MoELayout const &layout,
- void const *tokenFinalScales,
- int64_t permutedM,
- int64_t numTokens,
- cudaStream_t stream
Run the FC2 finalize kernel (grouped GEMM + scatter-reduce).
- Parameters:
inputFP4 – [permutedM, K/2] float4_e2m1fn_x2 on device
weight – [L, K, N/2] float4_e2m1fn_x2 on device (3D stacked, N-major byte layout — N axis innermost, 2 FP4 nibbles per byte along N = hidden_size). Matches the plugin v5 fc_down_qweights shape and the Marlin decode layout.
inputSF – atom-layout SF buffer on device (input A scales)
weightSF – atom-layout SF buffer on device (weight B scales, prefill-friendly M=N=H, K=I/16 — unchanged from v4)
output – [numTokens, N] bfloat16 on device (pre-zeroed, scatter target)
alpha – [L] float32 on device (per-expert scaling)
layout – MoE layout (tile metadata + permutation indices)
tokenFinalScales – [numTokens, topK] float32 on device (router weights)
permutedM – Total permuted rows
numTokens – Number of original tokens
stream – CUDA stream