Moe Activation Kernels#

void trt_edgellm::kernel::swiGluActivation(
rt::Tensor const &gateUpInput,
rt::Tensor &output,
int64_t numTokens,
int64_t intermediateDim,
cudaStream_t stream
)#

Apply SwiGLU activation: silu(gate) * up.

This kernel applies the SwiGLU (Swish-Gated Linear Unit) activation function, commonly used in MoE models like Qwen, Llama, and Mistral.

Given an input of shape [N, 2*D], it:

  1. Splits the input into gate [N, D] and up [N, D]

  2. Applies SiLU (Swish) activation to gate: silu(x) = x * sigmoid(x)

  3. Multiplies element-wise: output = silu(gate) * up

Optimizations:

  • Vectorized 128-bit loads/stores (8 FP16 elements) for better memory bandwidth

  • Fused split, activation, and multiplication in a single pass

  • No intermediate storage required

Parameters:
  • gateUpInput – Input tensor [numTokens, 2*intermediateDim] (FP16, GPU)

  • output – Output tensor [numTokens, intermediateDim] (FP16, GPU)

  • numTokens – Number of tokens

  • intermediateDim – Intermediate dimension (output will be this size)

  • stream – CUDA stream

Throws:

std::runtime_error – If any of the following preconditions are violated:

  • gateUpInput is not 2D with shape [numTokens, 2*intermediateDim]

  • output is not 2D with shape [numTokens, intermediateDim]

  • Either tensor is not FP16 or not on GPU

  • intermediateDim is not a multiple of 8 (required for 128-bit vectorized access)

  • Data pointers are not 16-byte aligned