Nv Fp4 Mo Efc2 Finalize Runner#

class NvFP4MoEFC2FinalizeRunner#

Public Functions

NvFP4MoEFC2FinalizeRunner(
int32_t numLocalExperts,
int32_t topK,
int32_t n,
int32_t k,
OutputDType outDtype = OutputDType::kBF16
)#
Parameters:
  • numLocalExperts – Number of local experts (L)

  • topK – Routing factor

  • n – Hidden size (N — FC2 output dimension)

  • k – Intermediate size (K — FC2 input dimension)

  • outDtype – Output element type (selects the AOT variant).

void run(
void const *inputFP4,
void const *weight,
void const *inputSF,
void const *weightSF,
void *output,
void const *alpha,
MoELayout const &layout,
void const *tokenFinalScales,
int64_t permutedM,
int64_t numTokens,
cudaStream_t stream
)#

Run the FC2 finalize kernel (grouped GEMM + scatter-reduce).

Parameters:
  • inputFP4 – [permutedM, K/2] float4_e2m1fn_x2 on device

  • weight – [L, K, N/2] float4_e2m1fn_x2 on device (3D stacked, N-major byte layout — N axis innermost, 2 FP4 nibbles per byte along N = hidden_size). Matches the plugin v5 fc_down_qweights shape and the Marlin decode layout.

  • inputSF – atom-layout SF buffer on device (input A scales)

  • weightSF – atom-layout SF buffer on device (weight B scales, prefill-friendly M=N=H, K=I/16 — unchanged from v4)

  • output – [numTokens, N] bfloat16 on device (pre-zeroed, scatter target)

  • alpha – [L] float32 on device (per-expert scaling)

  • layout – MoE layout (tile metadata + permutation indices)

  • tokenFinalScales – [numTokens, topK] float32 on device (router weights)

  • permutedM – Total permuted rows

  • numTokens – Number of original tokens

  • stream – CUDA stream

Public Static Functions

static bool loadKernelModules()#
static void unloadKernelModules()#