Quick Start Guide#

Repository: github.com/NVIDIA/TensorRT-Edge-LLM

For the NVIDIA DRIVE platform, please refer to the documentation shipped with the DriveOS release

This quick start guide will get you up and running with TensorRT Edge-LLM in ~15 minutes.

Prerequisites#

NVIDIA Jetson Thor
JetPack 7.1
Internet connection for downloading models
x86 Linux host with GPU for model export

Step 1: Clone Repository (on both x86 host and device)#

git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive

Step 2: Install Python Package (on x86 host)#

# Install Python package with all dependencies
pip3 install .

This installs the Python export tools and all required dependencies.

Step 3: Build C++ Project (on device)#

Ensure that CUDA and TensorRT is installed via Jetpack. TensorRT should be installed in /usr folder.

# Install system build tools if needed
sudo apt update
sudo apt install cmake build-essential

# Build the project
mkdir build
cd build
cmake .. \
    -DTRT_PACKAGE_DIR=/usr \
    -DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
    -DEMBEDDED_TARGET=jetson-thor
make -j$(nproc)

Note: Both the toolchain file and embedded target are required for all Edge device builds.

This will build:

C++ engine builder applications
C++ runtime and examples

Step 4: Download and Export a Model (on x86 host)#

Let’s use Qwen3-0.6B as a lightweight example:

# Quantize to FP8 (downloads model automatically)
tensorrt-edgellm-quantize-llm \
    --model_dir Qwen/Qwen3-0.6B \
    --output_dir ./quantized/qwen3-0.6b \
    --quantization fp8

# Export to ONNX
tensorrt-edgellm-export-llm \
    --model_dir ./quantized/qwen3-0.6b \
    --output_dir ./onnx_models/qwen3-0.6b

Step 5: Build TensorRT Engine (on Thor device)#

Transfer the entire ONNX folder to your Thor device, then:

./build/examples/llm/llm_build \
    --onnxDir ./onnx_models/qwen3-0.6b \
    --engineDir ./engines/qwen3-0.6b

Build time: ~2-5 minutes

Step 6: Run Inference#

Create an input file input.json with your prompts:

{
    "batch_size": 1,
    "temperature": 1.0,
    "top_p": 1.0,
    "top_k": 50,
    "max_generate_length": 128,
    "requests": [
        {
            "messages": [
                {
                    "role": "user",
                    "content": "What is the capital of United States?"
                }
            ]
        }
    ]
}

Then run inference:

./build/examples/llm/llm_inference \
    --engineDir ./engines/qwen3-0.6b \
    --inputFile input.json \
    --outputFile output.json

Input Format: Our format matches closely with the OpenAI API format. Please see examples/llm/INPUT_FORMAT.md for more details. Some example input files are available in tests/test_cases/ (for example, llm_basic.json).

Success! 🎉 Check output.json for model responses.

Next Steps#

Now that you’ve completed the quick start, continue to:

Installation: Detailed installation instructions for both Python export pipeline and C++ runtime
Supported Models: Learn about supported models and how to prepare them
Examples: Explore example applications and use cases
Customization Guide: Learn how to customize and extend the framework for your specific needs

For questions or issues, visit our TensorRT Edge-LLM GitHub repository.