Overview#
Repository: github.com/NVIDIA/TensorRT-Edge-LLM
What is TensorRT Edge-LLM?#
TensorRT Edge-LLM is NVIDIAβs high-performance C++ inference runtime for Large Language Models (LLMs) and Vision-Language Models (VLMs) on embedded platforms. It enables efficient deployment of state-of-the-art language models on resource-constrained devices such as NVIDIA Jetson and NVIDIA DRIVE platforms.
Supported Platforms#
Hardware Platforms#
Officially Supported Platforms:
Platform |
Software Release |
Link |
|---|---|---|
NVIDIA Jetson Thor |
JetPack 7.x |
|
NVIDIA DRIVE Thor |
NVIDIA DriveOS 7.2 |
|
NVIDIA Jetson Orin |
JetPack 7.2 |
Note: The platforms listed above are officially supported and tested. Jetson Orin supports FP16, INT8, and INT4 model precisions. For exact build flags by platform and JetPack release, see the Installation Guide.
Compatible Platforms:
Platform |
Software Release |
|---|---|
NVIDIA Jetson Orin |
JetPack 6.2+ |
Note: JetPack 7.2 is the supported Jetson Orin path. JetPack 6.2+ remains compatible for FP16, INT8, and INT4 workflows.
Supported Model Families#
TensorRT Edge-LLM supports Llama/Qwen/Nemotron language models, Qwen and InternVL vision-language models, Alpamayo 1, Phi-4-Multimodal, Qwen3-ASR/TTS, Nemotron-Omni, EAGLE3 draft models, and selected MoE checkpoints. For the complete support matrix, including Transformers class names, example checkpoints, precision requirements, and platform compatibility, see Supported Models.
Key Features#
π High Performance: Optimized CUDA kernels and TensorRT integration for minimum latency
πΎ Memory Efficient: Supporting 4-bit quantization for reduced memory footprint, with FP8 KV cache support for additional memory savings
π Production Ready: C++-only runtime with no Python dependencies, designed for deployment on edge devices
π― Edge Optimized: Built specifically for NVIDIA Jetson and DRIVE platforms with platform-specific optimizations
π§ Rich Feature Set: Supports LoRA adapters, EAGLE3 speculative decoding, system prompt caching, vision-language models, and an experimental high-level Python API/server
π Complete Toolkit: End-to-end workflow from checkpoint export to C++ runtime, with engine builder and examples
Key Components#
Code Location:
tensorrt_edgellm/quantization/(checkpoint quantization),tensorrt_edgellm/(ONNX export),experimental/server/(Python API/server),cpp/(runtime),examples/(C++ examples)
TensorRT Edge-LLM uses a three-stage pipeline:
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
HF_MODEL[HuggingFace Models<br>*including pre-quantized<br>checkpoints*]
PYTHON_EXPORT(Checkpoint-Based<br>Model Exporter)
ONNX_MODEL[ONNX<br>Model]
ENGINE_BUILDER(Engine Builder)
TRT_ENGINE[TensorRT<br>Engines]
CPP_RUNTIME(C++ Runtime)
SAMPLES(Examples)
APPLICATIONS(Applications)
HF_MODEL --> PYTHON_EXPORT
PYTHON_EXPORT --> ONNX_MODEL
ONNX_MODEL --> ENGINE_BUILDER
ENGINE_BUILDER --> TRT_ENGINE
TRT_ENGINE --> CPP_RUNTIME
CPP_RUNTIME --> SAMPLES
SAMPLES --> APPLICATIONS
classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef itemNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
class HF_MODEL inputNode
class ONNX_MODEL,TRT_ENGINE itemNode
class PYTHON_EXPORT,ENGINE_BUILDER,CPP_RUNTIME nvNode
class APPLICATIONS darkNode
class SAMPLES nvNode
Component |
Description |
|---|---|
Quantization Package |
Creates quantized HuggingFace-style checkpoints for the checkpoint exporter. Usage, Design |
Checkpoint Exporter |
Reads HuggingFace checkpoints directly and exports ONNX artifacts. Learn More |
Experimental Python API and Server |
Provides a vLLM-style Python API and OpenAI-compatible server. Learn More |
Engine Builder |
C++-based application that compiles ONNX models into optimized TensorRT engines. Learn More |
C++ Runtime |
C++-based runtime that executes TensorRT engines with CUDA graphs, LoRA, and EAGLE support. Learn More |
Examples |
Reference implementations demonstrating LLM, multimodal, and utility use cases. See the Quick Start Guide and example guides in the User Guide. |
Next Steps#
Ready to get started with TensorRT Edge-LLM? Follow these steps:
Installation Guide - Set up quantization and
tensorrt_edgellmon your x86 host and build the C++ runtime on your edge deviceQuick Start Guide - Run your first LLM inference in ~15 minutes with step-by-step instructions
Examples - Explore advanced workflows including VLM inference, speculative decoding, ASR, MoE, TTS, and VLA model inference
For questions or issues, visit our TensorRT Edge-LLM GitHub repository.