Nv Fp4 Mo E Contiguous GEMM Runner#

class NvFP4MoEContiguousGemmRunner#

Public Functions

NvFP4MoEContiguousGemmRunner( int32_t numLocalExperts, int32_t topK, int32_t n, int32_t k, int32_t tileSize = 128, Activation activation = Activation::kRelu2, OutputDType outDtype = OutputDType::kBF16 )#

Parameters:

numLocalExperts – Number of local experts (L)
topK – Routing factor
n – Intermediate size (N)
k – Hidden size (K)
tileSize – Tile size (128)
activation – Activation function (compiled into the AOT binary). Only Relu2 and Swiglu are compiled; Identity is not exported as it has no production use for FC1.
outDtype – Output element type (selects the AOT variant).

void run( void const *gatheredFP4, void const *weight, void const *gatheredSF, void const *weightSF, void *output, void const *alpha, MoELayout const &layout, int64_t permutedM, cudaStream_t stream )#

Run the contiguous grouped GEMM with fused alpha + activation.

Unlike the bucketed runner, this takes the layout directly — no per-group metadata construction needed. Alpha scaling and activation are applied inside the kernel epilogue in float32.

Parameters:

gatheredFP4 – [permutedM, K/2] float4_e2m1fn_x2 on device
weight – [L, K, N/2] float4_e2m1fn_x2 on device (3D stacked, N-major byte layout — N axis innermost, 2 FP4 nibbles per byte along N). Matches the plugin v4 fc_up_qweights shape and the Marlin decode layout.
gatheredSF – atom-layout SF buffer on device (input A scales)
weightSF – atom-layout SF buffer on device (weight B scales, prefill-friendly M=N, K=K/16 — unchanged from v3)
output – [permutedM, N_out] bfloat16 on device (output)
alpha – [L] float32 per-expert scaling on device
layout – MoE layout (tile metadata + permutation indices)
permutedM – Total permuted rows
stream – CUDA stream

Public Static Functions

static bool loadKernelModules()#

static void unloadKernelModules()#