Quick Start Guide#
Repository: github.com/NVIDIA/TensorRT-Edge-LLM
For the NVIDIA DRIVE platform, please refer to the documentation shipped with the DriveOS release
This quick start guide will get you up and running with TensorRT Edge-LLM in ~15 minutes.
Prerequisites#
NVIDIA Jetson Thor
JetPack 7.1
Internet connection for downloading models
x86 Linux host with GPU for model export
Step 1: Clone Repository (on both x86 host and device)#
git clone https://github.com/NVIDIA/TensorRT-Edge-LLM.git
cd TensorRT-Edge-LLM
git submodule update --init --recursive
Step 2: Install Python Package (on x86 host)#
# Install Python package with all dependencies
pip3 install .
This installs the Python export tools and all required dependencies.
Step 3: Build C++ Project (on device)#
Ensure that CUDA and TensorRT is installed via Jetpack. TensorRT should be installed in /usr folder.
# Install system build tools if needed
sudo apt update
sudo apt install cmake build-essential
# Build the project
mkdir build
cd build
cmake .. \
-DTRT_PACKAGE_DIR=/usr \
-DCMAKE_TOOLCHAIN_FILE=cmake/aarch64_linux_toolchain.cmake \
-DEMBEDDED_TARGET=jetson-thor
make -j$(nproc)
Note: Both the toolchain file and embedded target are required for all Edge device builds.
This will build:
C++ engine builder applications
C++ runtime and examples
Step 4: Download and Export a Model (on x86 host)#
Let’s use Qwen3-0.6B as a lightweight example:
# Quantize to FP8 (downloads model automatically)
tensorrt-edgellm-quantize-llm \
--model_dir Qwen/Qwen3-0.6B \
--output_dir ./quantized/qwen3-0.6b \
--quantization fp8
# Export to ONNX
tensorrt-edgellm-export-llm \
--model_dir ./quantized/qwen3-0.6b \
--output_dir ./onnx_models/qwen3-0.6b
Step 5: Build TensorRT Engine (on Thor device)#
Transfer the entire ONNX folder to your Thor device, then:
./build/examples/llm/llm_build \
--onnxDir ./onnx_models/qwen3-0.6b \
--engineDir ./engines/qwen3-0.6b
Build time: ~2-5 minutes
Step 6: Run Inference#
Create an input file input.json with your prompts:
{
"batch_size": 1,
"temperature": 1.0,
"top_p": 1.0,
"top_k": 50,
"max_generate_length": 128,
"requests": [
{
"messages": [
{
"role": "user",
"content": "What is the capital of United States?"
}
]
}
]
}
Then run inference:
./build/examples/llm/llm_inference \
--engineDir ./engines/qwen3-0.6b \
--inputFile input.json \
--outputFile output.json
Input Format: Our format matches closely with the OpenAI API format. Please see examples/llm/INPUT_FORMAT.md for more details. Some example input files are available in tests/test_cases/ (for example, llm_basic.json).
Success! 🎉 Check output.json for model responses.
Next Steps#
Now that you’ve completed the quick start, continue to:
Installation: Detailed installation instructions for both Python export pipeline and C++ runtime
Supported Models: Learn about supported models and how to prepare them
Examples: Explore example applications and use cases
Customization Guide: Learn how to customize and extend the framework for your specific needs
For questions or issues, visit our TensorRT Edge-LLM GitHub repository.