Nv Fp4 Mo Efc2 Finalize Runner#

class NvFP4MoEFC2FinalizeRunner#

Public Functions

NvFP4MoEFC2FinalizeRunner( int32_t numLocalExperts, int32_t topK, int32_t n, int32_t k, OutputDType outDtype = OutputDType::kBF16 )#

Parameters:

numLocalExperts – Number of local experts (L)
topK – Routing factor
n – Hidden size (N — FC2 output dimension)
k – Intermediate size (K — FC2 input dimension)
outDtype – Output element type (selects the AOT variant).

void run( void const *inputFP4, void const *weight, void const *inputSF, void const *weightSF, void *output, void const *alpha, MoELayout const &layout, void const *tokenFinalScales, int64_t permutedM, int64_t numTokens, cudaStream_t stream )#

Run the FC2 finalize kernel (grouped GEMM + scatter-reduce).

Parameters:

inputFP4 – [permutedM, K/2] float4_e2m1fn_x2 on device
weight – [L, K, N/2] float4_e2m1fn_x2 on device (3D stacked, N-major byte layout — N axis innermost, 2 FP4 nibbles per byte along N = hidden_size). Matches the plugin v5 fc_down_qweights shape and the Marlin decode layout.
inputSF – atom-layout SF buffer on device (input A scales)
weightSF – atom-layout SF buffer on device (weight B scales, prefill-friendly M=N=H, K=I/16 — unchanged from v4)
output – [numTokens, N] bfloat16 on device (pre-zeroed, scatter target)
alpha – [L] float32 on device (per-expert scaling)
layout – MoE layout (tile metadata + permutation indices)
tokenFinalScales – [numTokens, topK] float32 on device (router weights)
permutedM – Total permuted rows
numTokens – Number of original tokens
stream – CUDA stream

Public Static Functions

static bool loadKernelModules()#

static void unloadKernelModules()#