Quick Start Guide#
Repository: github.com/NVIDIA/TensorRT-Edge-LLM
For the NVIDIA DRIVE platform, please refer to the documentation shipped with the DriveOS release
This quick start guide will get you up and running with TensorRT Edge-LLM in ~15 minutes.
Prerequisites: Complete the Installation Guide for both x86 host and edge device before proceeding.
Part 1: Export and Quantize on x86 Host#
Standard LLM Export#
Let’s use Qwen3-0.6B as a lightweight example:
# Set up workspace directory
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen3-0.6B
mkdir -p $WORKSPACE_DIR
cd $WORKSPACE_DIR
# Step 1: Quantize to FP8 (downloads model automatically)
tensorrt-edgellm-quantize-llm \
--model_dir Qwen/Qwen3-0.6B \
--output_dir $MODEL_NAME/quantized \
--quantization fp8
# Step 2: Export to ONNX
tensorrt-edgellm-export-llm \
--model_dir $MODEL_NAME/quantized \
--output_dir $MODEL_NAME/onnx
⚠️ Troubleshooting Export Issues: If you encounter issues during quantization or export, see the Python Export Pipeline - Common Issues and Solutions.
Transfer to Device#
Transfer the ONNX folder to your Thor device:
# From x86 host - transfer to device
scp -r $MODEL_NAME/onnx <device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/
Note: Replace
<device_user>and<device_ip>with your actual device credentials (e.g.,nvidia@192.168.1.100). If the directory doesn’t exist on the device, create it first:ssh <device_user>@<device_ip> "mkdir -p ~/tensorrt-edgellm-workspace/$MODEL_NAME"
Part 2: Build and Run on Edge Device#
Build TensorRT Engine#
On your Thor device:
# Set up workspace directory
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen3-0.6B
cd ~/TensorRT-Edge-LLM
# Build engine
./build/examples/llm/llm_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--maxBatchSize 1 \
--maxInputLen 1024 \
--maxKVCacheCapacity 4096
Build time: ~2-5 minutes
Run Inference#
Create an input file with a sample question:
cat > $WORKSPACE_DIR/input.json << 'EOF'
{
"batch_size": 1,
"temperature": 1.0,
"top_p": 1.0,
"top_k": 50,
"max_generate_length": 128,
"requests": [
{
"messages": [
{
"role": "user",
"content": "What is the capital of United States?"
}
]
}
]
}
EOF
Tip: You can also use example input files from
~/TensorRT-Edge-LLM/tests/test_cases/(e.g.,llm_basic.json) instead of creating your own.
Run inference:
cd ~/TensorRT-Edge-LLM
./build/examples/llm/llm_inference \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--inputFile $WORKSPACE_DIR/input.json \
--outputFile $WORKSPACE_DIR/output.json
Verify the output:
# View the model response
cat $WORKSPACE_DIR/output.json
You should see a JSON response with the model’s answer, similar to:
{
"responses": [
{
"text": "The capital of the United States is Washington, D.C.",
"finish_reason": "stop"
}
]
}
Success! 🎉 You’ve successfully run LLM inference on your edge device!
Next Steps#
For more advanced workflows, see Examples:
VLM Inference - Vision-language models with image understanding
EAGLE Speculative Decoding - Accelerated generation for LLM and VLM
LoRA Support - Dynamic adapter loading at runtime
Input Format: Our format matches closely with the OpenAI API format. See Input Format Guide for detailed specifications. Example input files are available in tests/test_cases/ (e.g., llm_basic.json, vlm_basic.json).