ASR (Automatic Speech Recognition)#
Complete workflow for speech recognition with audio understanding capabilities.
Example model: Qwen3-ASR-0.6B
Prerequisites: Complete the Installation Guide before proceeding.
Step 1: Export (x86 Host)#
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen3-ASR-0.6B
mkdir -p $WORKSPACE_DIR
cd $WORKSPACE_DIR
# Export audio encoder
tensorrt-edgellm-export-audio \
--model_dir Qwen/Qwen3-ASR-0.6B \
--output_dir $MODEL_NAME/onnx/audio
# Export language model
tensorrt-edgellm-export-llm \
--model_dir Qwen/Qwen3-ASR-0.6B \
--output_dir $MODEL_NAME/onnx/llm
Step 2: Transfer to Device#
# Transfer ONNX to device
scp -r $MODEL_NAME/onnx \
<device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/
Step 3: Build Engines (Thor Device)#
# Set up workspace directory on device
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen3-ASR-0.6B
cd ~/TensorRT-Edge-LLM
# Build language model engine
./build/examples/llm/llm_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/llm \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
--maxBatchSize 1 \
--maxInputLen 1024 \
--maxKVCacheCapacity 4096
# Build audio encoder engine
./build/examples/multimodal/audio_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/audio \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/audio \
--minTimeSteps 1000 \
--maxTimeSteps 3000
Step 4: Preprocess Audio Input (x86 Host or Thor Device)#
Audio files must be converted to mel-spectrogram safetensors format before inference:
cd ~/TensorRT-Edge-LLM
# Convert WAV to safetensors mel-spectrogram format
python -m tensorrt_edgellm.scripts.preprocess_audio \
--input /path/to/audio.wav \
--output $WORKSPACE_DIR/audio_input.safetensors
Note: Supported audio formats include .wav, .mp3, .flac, .ogg, and .m4a. If preprocessing is done on the x86 host, transfer the output safetensors file to the device before running inference.
Step 5: Run Inference (Thor Device)#
Create an input file $WORKSPACE_DIR/input_asr.json (replace /path/to/audio_input.safetensors with the actual preprocessed audio file path):
{
"batch_size": 1,
"temperature": 1.0,
"top_p": 1.0,
"top_k": 50,
"max_generate_length": 256,
"requests": [
{
"messages": [
{
"role": "system",
"content": ""
},
{
"role": "user",
"content": [{"type": "audio", "audio": "/path/to/audio_input.safetensors"}]
}
]
}
]
}
Run inference:
cd ~/TensorRT-Edge-LLM
./build/examples/llm/llm_inference \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
--multimodalEngineDir $WORKSPACE_DIR/$MODEL_NAME/engines/audio \
--inputFile $WORKSPACE_DIR/input_asr.json \
--outputFile $WORKSPACE_DIR/output_asr.json
Check output_asr.json for the speech recognition transcription.