Omni (Audio + Vision + Speech I/O)#
End-to-end workflow for Qwen3-Omni — a unified multimodal model that ingests text + audio + image and emits text + speech. Covers quantization, ONNX export, engine build, and inference.
Supported models#
Model |
Backbone |
HF checkpoint |
Recipes |
|---|---|---|---|
Qwen3-Omni-30B-A3B-Instruct |
MoE (128 experts) |
NVFP4 Thinker + Talker |
Omni uses a six-engine layout:
Thinker (text decoder, generates assistant tokens) + Talker (text decoder, generates codec tokens) + CodePredictor (small head over codec tokens) + Audio encoder (Whisper-style mel) + Visual encoder (Qwen3-VL ViT) + Code2Wav (codec → 24 kHz PCM).
Prerequisites: Complete the Installation Guide before proceeding.
Part 1: Quantize (NVFP4 Thinker + Talker)#
First set the shell variables used throughout the remaining parts:
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
mkdir -p $WORKSPACE_DIR
export OMNI_MODEL=Qwen3-Omni-30B-A3B-Instruct
export HF_ROOT=Qwen/$OMNI_MODEL # or a local snapshot dir
export ONNX=$WORKSPACE_DIR/$OMNI_MODEL/onnx
export ENG=$WORKSPACE_DIR/$OMNI_MODEL/engines
Thinker and Talker text-MoE backbones are jointly post-training quantized
to NVFP4 in a single ModelOpt pass. Calibration uses a multimodal dataset
(LibriSpeech audio + MMMU images + cnn_dailymail text) chained through the
FP16 Thinker so the Talker sees realistic hidden_projection /
text_projection inputs. The visual encoder, audio encoder, code2wav
vocoder, talker projections, and code-predictor head stay in FP16.
export QUANT_ROOT=$WORKSPACE_DIR/$OMNI_MODEL/nvfp4
tensorrt-edgellm-quantize qwen3-omni \
--model_dir $HF_ROOT \
--output_dir $QUANT_ROOT \
--talker_num_audio 150 \
--talker_num_image 150 \
--talker_num_text 200
Output:
$QUANT_ROOT/
├── thinker/ # standalone HF NVFP4 ckpt, model_type=qwen3_omni_moe_text
└── talker/ # standalone HF NVFP4 ckpt, model_type=qwen3_omni_moe_talker
Part 2: Export ONNX#
tensorrt-edgellm-export exports every component the input checkpoint
supports. By default it runs every stage; pass --components to restrict
the run to a subset (useful when iterating on a single component).
CLI reference (Omni-relevant flags)#
Flag |
Description |
|---|---|
|
Comma-separated allow-list ( |
|
Path to a full HF root checkpoint. When set, the Talker stage extracts |
|
Write |
|
Use |
|
Negative filters: skip these components entirely (applied on top of |
Export commands (three exports)#
The NVFP4 quantizer writes Thinker and Talker as standalone checkpoints
(separate config.json per submodel). Visual / audio / code2wav /
code_predictor remain in the original HF root.
mkdir -p $ONNX
# Thinker: NVFP4 standalone Thinker checkpoint -> ONNX
tensorrt-edgellm-export $QUANT_ROOT/thinker $ONNX/thinker
# Talker: NVFP4 standalone Talker checkpoint -> ONNX + projection weights
tensorrt-edgellm-export \
--talker-sidecar-from $HF_ROOT \
$QUANT_ROOT/talker $ONNX/talker
# Visual + audio + code2wav + code_predictor: all from the HF root
tensorrt-edgellm-export \
--components visual,audio,code2wav,code_predictor \
$HF_ROOT $ONNX/multimodal
Expected layout#
$ONNX/
├── thinker/llm/ # Thinker MoE ONNX
│ ├── model.onnx + model.onnx.data
│ ├── config.json # model: qwen3_omni_moe_text
│ ├── embedding.safetensors
│ ├── processed_chat_template.json
│ └── tokenizer files
├── talker/llm/ # Talker MoE ONNX + sidecars
│ ├── model.onnx + model.onnx.data
│ ├── config.json # model: qwen3_omni_moe_talker
│ ├── embedding.safetensors # codec embedding
│ ├── hidden_projection.safetensors # extracted from HF root
│ └── text_projection.safetensors # extracted from HF root
└── multimodal/
├── audio/ # audio encoder
├── visual/ # visual ViT
├── code2wav/ # codec → PCM
└── code_predictor/ # codec head
Part 3: Build Engines#
Six engines total; llm_build covers Thinker / Talker / CodePredictor,
audio_build covers the audio encoder and Code2Wav, visual_build covers
the visual encoder.
export THINKER_ONNX=$ONNX/thinker/llm
export TALKER_ONNX=$ONNX/talker/llm
export CP_ONNX=$ONNX/multimodal/code_predictor
export AUDIO_ONNX=$ONNX/multimodal/audio
export CODE2WAV_ONNX=$ONNX/multimodal/code2wav
export VISUAL_ONNX=$ONNX/multimodal/visual
Build commands#
# 1. Thinker
./build/examples/llm/llm_build \
--onnxDir $THINKER_ONNX \
--engineDir $ENG/thinker \
--maxBatchSize 1 \
--maxInputLen 1024 \
--maxKVCacheCapacity 2048
# 2. Talker (sidecars live next to the engine; symlink or copy)
./build/examples/llm/llm_build \
--onnxDir $TALKER_ONNX \
--engineDir $ENG/talker \
--maxBatchSize 1 \
--maxInputLen 1024 \
--maxKVCacheCapacity 2048
ln -sf $TALKER_ONNX/hidden_projection.safetensors $ENG/talker/
ln -sf $TALKER_ONNX/text_projection.safetensors $ENG/talker/
# 3. CodePredictor — must sit next to the Talker engine (the runtime
# derives its path from --talkerEngineDir as <parent>/code_predictor)
./build/examples/llm/llm_build \
--onnxDir $CP_ONNX \
--engineDir $ENG/code_predictor \
--maxBatchSize 1 \
--maxInputLen 256 \
--maxKVCacheCapacity 256
# 4. Code2Wav
./build/examples/multimodal/audio_build \
--onnxDir $CODE2WAV_ONNX \
--engineDir $ENG/code2wav \
--maxCodeLen 1000
# 5. Audio encoder
./build/examples/multimodal/audio_build \
--onnxDir $AUDIO_ONNX \
--engineDir $ENG/multimodal/audio
# 6. Visual encoder
./build/examples/multimodal/visual_build \
--onnxDir $VISUAL_ONNX \
--engineDir $ENG/multimodal/visual
--maxBatchSize can be raised to the concurrency budget needed at
runtime. Place the CodePredictor engine in the same parent directory as
the Talker engine — the Qwen3-Omni runtime derives its path from
--talkerEngineDir (<parent>/code_predictor).
Part 4: Preprocess Audio Input#
Audio files must be converted to mel-spectrogram safetensors before inference (same as ASR):
tensorrt-edgellm-preprocess-audio \
--input /path/to/audio.wav \
--output $WORKSPACE_DIR/audio_input.safetensors
Supported formats: .wav, .mp3, .flac, .ogg, .m4a.
Part 5: Run Inference#
llm_inference drives all four scenarios. Audio output is opt-in via
--enableAudioOutput; without that flag only the Thinker generates text.
The runtime detects model_type from the engine config.
Scenario A: Text input → text output#
{
"max_generate_length": 256,
"requests": [
{
"messages": [
{"role": "system", "content": ""},
{"role": "user", "content": [{"type": "text", "text": "What is the capital of France?"}]}
]
}
]
}
./build/examples/llm/llm_inference \
--engineDir $ENG/thinker \
--multimodalEngineDir $ENG/multimodal \
--inputFile input_text.json \
--outputFile output_text.json
Scenario B: Audio input → text output#
messages[*].content carries {"type": "audio", "audio": "<preprocessed.safetensors>"}.
Scenario C: Image input → text output#
messages[*].content carries {"type": "image", "image": "<path/to/image.jpg>"}.
Scenario D: Text/audio/image input → text + speech output#
./build/examples/llm/llm_inference \
--engineDir $ENG/thinker \
--multimodalEngineDir $ENG/multimodal \
--talkerEngineDir $ENG/talker \
--code2wavEngineDir $ENG/code2wav \
--inputFile input.json \
--outputFile output.json \
--outputAudioDir ./audio_output \
--enableAudioOutput \
--enableThinkerTalkerStreaming
--enableThinkerTalkerStreaming interleaves Talker prefill / decode with
Thinker token generation so the first audio chunk is emitted before the
Thinker text is fully complete, significantly reducing time-to-first-audio.
Input file extra parameters (audio output)#
Parameter |
Default |
Description |
|---|---|---|
|
0.9 |
Sampling temperature for codec tokens |
|
10 |
Top-K (top_k=10 empirically best for English TTS) |
|
1.0 |
Top-P (kept at 1.0 with low top_k) |
|
1.0 |
Penalize repeated codec tokens |
|
4096 |
Max codec frames per request |
|
config default |
Speaker name: |
Output JSON Example#
{
"responses": [
{
"request_idx": 0,
"output_text": "The capital of France is Paris.",
"audio_file": "./audio_output/audio_req0_batch0.wav",
"audio_samples": 88830,
"audio_sample_rate": 24000,
"audio_duration_ms": 3700
}
]
}
Notes#
Calibration quality: the default
--talker_num_{audio,image,text}values were tuned for English TTS. Reducing--talker_num_textbelow 100 or removing modalities (audio/image) can regress both OmniBench and TTS WER.Talker is the critical path: the Talker MoE is heavier per step than the Thinker MoE on the same model (Talker 128 experts × routed top-6 + shared expert, vs Thinker 128 experts × top-8). In streaming mode the Talker drives the audio-out latency.
Talker
text_projectionMLP: the Talker consumes the Thinker hidden state through atext_projectionMLP sidecar. Build C++ with-DENABLE_CUTE_DSL=gemm— without it this MLP silently returns zeros on Ampere/Blackwell, producing garbled or empty Talker audio.