Quick Start Guide#

This quick start guide will get you up and running with TensorRT Edge-LLM in ~15 minutes.

Prerequisites: Complete the Installation Guide on the machine where you will run TensorRT Edge-LLM. For the manual export and device-transfer path, set up both the x86 host and edge device.



Manual Export and C++ Runtime Path#

Use this path when you need explicit control over ONNX export, engine build flags, file transfer, or the low-level C++ examples.

Part 1: Export on x86 Host#

This part runs on a standard x86 host with an NVIDIA GPU. DriveOS users: this process does not need to run in DriveOS Docker; use your regular x86 development machine.

Alternative: Legacy Pipeline#

Deprecated: The legacy tensorrt_edgellm pipeline is kept for compatibility. New workflows should use llm_loader. The tensorrt_edgellm/ folder will be removed in 0.8.0, with full feature parity provided by the experimental/quantization -> experimental/llm_loader workflow for all models and features. See the migration guide.

# Step 1: Quantize to FP8 (downloads model automatically)
tensorrt-edgellm-quantize-llm \
    --model_dir Qwen/Qwen3-0.6B \
    --output_dir $MODEL_NAME/quantized \
    --quantization fp8

# Step 2: Export to ONNX
tensorrt-edgellm-export-llm \
    --model_dir $MODEL_NAME/quantized \
    --output_dir $MODEL_NAME/onnx

Troubleshooting: If you encounter issues during export, see the Python Export Pipeline - Common Issues and Solutions.

Transfer to Device#

Transfer the ONNX folder to your Thor device:

# From x86 host - transfer to device
scp -r $MODEL_NAME/onnx <device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/

Note: Replace <device_user> and <device_ip> with your actual device credentials (e.g., nvidia@192.168.1.100). If the directory doesn’t exist on the device, create it first: ssh <device_user>@<device_ip> "mkdir -p ~/tensorrt-edgellm-workspace/$MODEL_NAME"


Part 2: Build and Run on Edge Device#

Build TensorRT Engine#

On your Thor device:

# Set up workspace directory
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Qwen3-0.6B
cd ~/TensorRT-Edge-LLM

# Build engine
./build/examples/llm/llm_build \
    --onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
    --maxBatchSize 1 \
    --maxInputLen 1024 \
    --maxKVCacheCapacity 4096

Build time: ~2-5 minutes

Run Inference#

Create an input file with a sample question:

cat > $WORKSPACE_DIR/input.json << 'EOF'
{
    "batch_size": 1,
    "temperature": 1.0,
    "top_p": 1.0,
    "top_k": 50,
    "max_generate_length": 128,
    "requests": [
        {
            "messages": [
                {
                    "role": "user",
                    "content": "What is the capital of United States?"
                }
            ]
        }
    ]
}
EOF

Tip: You can also use example input files from ~/TensorRT-Edge-LLM/tests/test_cases/ (e.g., llm_basic.json) instead of creating your own.

Run inference:

cd ~/TensorRT-Edge-LLM

./build/examples/llm/llm_inference \
    --engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
    --inputFile $WORKSPACE_DIR/input.json \
    --outputFile $WORKSPACE_DIR/output.json

Verify the output:

# View the model response
cat $WORKSPACE_DIR/output.json

You should see a JSON response with the model’s answer, similar to:

{
  "responses": [
    {
      "text": "The capital of the United States is Washington, D.C.",
      "finish_reason": "stop"
    }
  ]
}

Success! 🎉 You’ve successfully run LLM inference on your edge device!


Next Steps#

For more advanced workflows, see the example guides:

  • VLM Inference - Vision-language models with image understanding

  • Speculative Decoding - Speculative decoding for LLM and VLM

  • Phi-4-Multimodal - Phi-4 Multimodal

  • ASR - Automatic speech recognition

  • MoE - Mixture of Experts models (CPU-only export, Qwen3-30B-A3B-GPTQ-Int4)

  • TTS - Text-to-speech synthesis

Checkpoint-Based Loader: For detailed documentation on the recommended export pipeline, pre-quantized checkpoint support, and migration from the legacy tools, see Checkpoint-Based Model Loader.

Quantization: To create quantized checkpoints for llm_loader, see Quantization.

Experimental Python API and Server: To use the vLLM-style high-level Python API or an OpenAI-compatible chat server, see Experimental High-Level Python API and Server.

Input Format: Our format matches closely with the OpenAI API format. See Input Format Guide for detailed specifications. Example input files are available in tests/test_cases/ (e.g., llm_basic.json, vlm_basic.json).