TTS (Text-to-Speech)#
This guide covers the full pipeline for running Qwen3-TTS: export on x86 host, engine build on device, and inference.
Supported model: Qwen3-TTS-12Hz-1.7B-CustomVoice
Note: Unlike Qwen3-Omni, Qwen3-TTS has no Thinker or visual encoder. The text embedding is self-contained in the Talker and exported as
text_embedding.safetensors.
Prerequisites: Complete the Installation Guide before proceeding.
Part 0: Install TTS Dependency (x86 Host)#
The export pipeline loads the Qwen3-TTS model via the qwen-tts package. Install it before exporting:
Warning: Installing
qwen-ttsmay break package versions in your current environment (e.g.transformers,torch). Use a dedicated virtual environment for Qwen3-TTS export only — do not share it with other model workflows. We need this special workflow until Qwen3-TTS gets merged into HuggingFace transformers.
cd TensorRT-Edge-LLM
python3 -m venv venv-qwen3-tts
source venv-qwen3-tts/bin/activate
pip3 install qwen-tts # install qwen3-tts and its required dependencies.
pip3 install . # install TensorRT Edge-LLM export dependencies
Part 1: Export on x86 Host#
Qwen3-TTS has three components. Export them separately.
Export Talker and CodePredictor (LLM components)#
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export TTS_MODEL=Qwen3-TTS-12Hz-1.7B-CustomVoice
export ONNX_OUTPUT_DIR=$WORKSPACE_DIR/$TTS_MODEL/onnx
export TTS_CHAT_TEMPLATE=./tensorrt_edgellm/chat_templates/templates/qwen3tts.json
# Exports talker/ and code_predictor/ subdirectories
tensorrt-edgellm-export-llm \
--model_dir Qwen/$TTS_MODEL \
--output_dir $ONNX_OUTPUT_DIR \
--chat_template $TTS_CHAT_TEMPLATE \
--export_models talker,code_predictor
Export Code2Wav vocoder#
# Exports tokenizer_decoder/ subdirectory
tensorrt-edgellm-export-audio \
--model_dir Qwen/$TTS_MODEL \
--output_dir $ONNX_OUTPUT_DIR \
--export_models tokenizer_decoder
Expected Export Output#
$ONNX_OUTPUT_DIR/
├── llm/
│ ├── talker/
│ │ ├── model.onnx + onnx_model.data # Talker ONNX
│ │ ├── config.json # model_type: qwen3_tts_talker
│ │ ├── embedding.safetensors # codec embedding
│ │ ├── text_embedding.safetensors # TTS-only (no Thinker)
│ │ └── text_projection.safetensors
│ ├── code_predictor/
│ │ ├── model.onnx + onnx_model.data # CodePredictor ONNX
│ │ ├── config.json
│ │ ├── codec_embeddings.safetensors # 15 embeddings
│ │ ├── lm_heads.safetensors # 15 lm_heads
│ │ └── small_to_mtp_projection.safetensors # if not Identity
│ ├── tokenizer_config.json # at top level (no thinker/)
│ ├── processed_chat_template.json # chat template for runtime
│ └── tokenizer files # tokenizer vocab/merges copied from model
└── audio/
├── tokenizer_decoder/
│ ├── model.onnx + onnx_model.data # Code2Wav vocoder
│ └── config.json
└── speaker_encoder/ # Base models only
├── model.onnx + onnx_model.data
└── config.json
Transfer to Device#
scp -r $ONNX_OUTPUT_DIR <user>@<device>:~/tensorrt-edgellm-workspace/$TTS_MODEL/
Part 2: Build Engines#
Three engine builds are required. Run these on the edge device.
cd ~/TensorRT-Edge-LLM
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export TTS_MODEL=Qwen3-TTS-12Hz-1.7B-CustomVoice
export ONNX=$WORKSPACE_DIR/$TTS_MODEL/onnx
export ENG=$WORKSPACE_DIR/$TTS_MODEL/engines
# 1. Build Talker LLM engine
./build/examples/llm/llm_build \
--onnxDir $ONNX/llm/talker \
--engineDir $ENG/talker \
--maxInputLen 4096 \
--maxKVCacheCapacity 4096 \
--maxBatchSize 1
# 2. Build CodePredictor LLM engine
./build/examples/llm/llm_build \
--onnxDir $ONNX/llm/code_predictor \
--engineDir $ENG/code_predictor \
--maxInputLen 4096 \
--maxKVCacheCapacity 4096 \
--maxBatchSize 1
# 3. Build Code2Wav engine
./build/examples/multimodal/audio_build \
--onnxDir $ONNX/audio/tokenizer_decoder \
--engineDir $ENG/code2wav
Note:
--maxBatchSizemust be set to 1. The Qwen3-TTS ONNX export uses a fixed batch size of 1; larger values are not supported.
Build time: < 5 minutes
Copy Tokenizer and Chat Template Files to Engine Folder#
The runtime loads the tokenizer and chat template from the engine directory. Copy the required files from the ONNX export output:
cp $ONNX/llm/*.json $ENG/ # includes tokenizer_config.json, processed_chat_template.json, etc.
Part 3: Run Inference#
Input File Format#
Each request specifies a messages array and an optional per-request speaker. If omitted, the top-level speaker default is used.
{
"talker_temperature": 0.9,
"talker_top_k": 50,
"repetition_penalty": 1.05,
"speaker": "ryan",
"requests": [
{
"messages": [{"role": "assistant", "content": "Hello, how can I help you today?"}]
},
{
"speaker": "serena",
"messages": [{"role": "assistant", "content": "The weather is sunny and warm."}]
}
]
}
Available speakers: ryan, serena, aiden, vivian, dylan, eric, uncle_fu, ono_anna, sohee
Sampling parameters:
Parameter |
Default |
Description |
|---|---|---|
|
0.9 |
Sampling temperature |
|
50 |
Top-K sampling |
|
1.0 |
Top-P sampling |
|
1.05 |
Penalize repeated codec tokens |
|
4096 |
Max codec frames per request |
|
config default |
Top-level speaker fallback |
Run#
cd /path/to/TensorRT-Edge-LLM
./build/examples/omni/qwen3_tts_inference \
--talkerEngineDir $ENG/talker \
--code2wavEngineDir $ENG/code2wav \
--tokenizerDir $ENG \
--inputFile input.json \
--outputFile output.json \
--outputAudioDir ./audio_output
Generated .wav files are named audio_req{N}.wav (one per request). The output JSON records per-request metadata: audio file path, sample count, duration, and RVQ code file path.
Output JSON Example#
{
"responses": [
{
"request_idx": 0,
"messages": [{"role": "assistant", "content": "Hello, how can I help you today?"}],
"audio_file": "./audio_output/audio_req0.wav",
"audio_samples": 120960,
"audio_sample_rate": 24000,
"audio_duration_ms": 5040,
"rvq_file": "./audio_output/rvq_req0.safetensors"
}
]
}
Notes#
--code2wavEngineDiris optional: auto-detected asparent(talkerEngineDir)/code2wavif not set.RVQ code files (
.safetensors) are saved alongside audio when--outputAudioDiris set and can be used to re-synthesize audio without re-running the TTS model.