MoE (Mixture of Experts)#
Complete workflow for Mixture of Experts (MoE) models using pre-quantized INT4 or NVFP4 checkpoints.
Note: For very large NVFP4 MoE checkpoints such as Nemotron Super 120B, externalize NVFP4 MoE plugin weights during export and keep the generated safetensors file with the ONNX directory.
Note: NVFP4 MoE uses separate plugins with different FC1 weight layouts:
Nvfp4MoePluginon SM100/101/110 (default) andNvFP4MoEPluginGeforceon SM120/121. SetEDGELLM_NVFP4_MOE_TARGET=sm12xbefore export for SM120/SM121 deployments; re-export if you change deployment GPU.
Prerequisites: Complete the Installation Guide before proceeding.
Step 1: Export (x86 Host, CPU-only)#
Export always runs on CPU; no GPU is required:
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen3-30B-A3B-GPTQ-Int4
mkdir -p $WORKSPACE_DIR
cd $WORKSPACE_DIR
tensorrt-edgellm-export \
Qwen/Qwen3-30B-A3B-GPTQ-Int4 \
$MODEL_NAME/exported
mkdir -p $MODEL_NAME/onnx
cp -a $MODEL_NAME/exported/llm/. $MODEL_NAME/onnx/
For Nemotron Super 120B NVFP4, externalize the MoE plugin weights to reduce engine build memory pressure:
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4
mkdir -p $WORKSPACE_DIR
cd $WORKSPACE_DIR
# For NVFP4 MoE engines on SM120/SM121, set the target before exporting:
export EDGELLM_NVFP4_MOE_TARGET=sm12x
tensorrt-edgellm-export \
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 \
$MODEL_NAME/exported \
--externalize-weights nvfp4_moe \
--max-kv-cache-capacity 4096
mkdir -p $MODEL_NAME/onnx
cp -a $MODEL_NAME/exported/llm/. $MODEL_NAME/onnx/
Step 2: Transfer to Device#
# Transfer ONNX to device
scp -r $MODEL_NAME/onnx \
<device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/
Step 3: Build Engine (Edge Device)#
llm_build is the same as for regular LLMs:
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen3-30B-A3B-GPTQ-Int4
cd /path/to/TensorRT-Edge-LLM
./build/examples/llm/llm_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--maxBatchSize 1 \
--maxInputLen 3072 \
--maxKVCacheCapacity 4096
Step 4: Run Inference (Edge Device)#
llm_inference is the same as for regular LLMs:
cd /path/to/TensorRT-Edge-LLM
./build/examples/llm/llm_inference \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--inputFile $WORKSPACE_DIR/input.json \
--outputFile $WORKSPACE_DIR/output.json
MoE export uses CPU-only; build and inference use the same llm_build and llm_inference as standard LLMs.