MoE (Mixture of Experts)#
Complete workflow for Mixture of Experts (MoE) models using a pre-quantized GPTQ-Int4 model.
Currently supported model: Qwen3-30B-A3B-GPTQ-Int4
Note: MoE export only works on CPU (
--device cpu). No GPU is required for the export step.
Prerequisites: Complete the Installation Guide before proceeding.
Additional dependencies: Install gptqmodel (CPU-only) and optimum 2.0.0:
BUILD_CUDA_EXT=0 pip install -v gptqmodel==4.2.5 --no-build-isolation
pip install optimum==2.0.0
Step 1: Export (x86 Host, CPU-only)#
Export can be run on CPU with --device cpu; no GPU is required:
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen3-30B-A3B-GPTQ-Int4
mkdir -p $WORKSPACE_DIR
cd $WORKSPACE_DIR
tensorrt-edgellm-export-llm \
--model_dir Qwen/Qwen3-30B-A3B-GPTQ-Int4 \
--output_dir $MODEL_NAME/onnx \
--device cpu
Step 2: Transfer to Device#
# Transfer ONNX to device
scp -r $MODEL_NAME/onnx \
<device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/
Step 3: Build Engine (Edge Device)#
llm_build is the same as for regular LLMs:
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen3-30B-A3B-GPTQ-Int4
cd ~/TensorRT-Edge-LLM
./build/examples/llm/llm_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--maxBatchSize 1 \
--maxInputLen 3072 \
--maxKVCacheCapacity 4096
Step 4: Run Inference (Edge Device)#
llm_inference is the same as for regular LLMs:
cd ~/TensorRT-Edge-LLM
./build/examples/llm/llm_inference \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--inputFile $WORKSPACE_DIR/input.json \
--outputFile $WORKSPACE_DIR/output.json
MoE export uses CPU-only; build and inference use the same llm_build and llm_inference as standard LLMs.