Overview#
Repository: github.com/NVIDIA/TensorRT-Edge-LLM
What is TensorRT Edge-LLM?#
TensorRT Edge-LLM is NVIDIAβs high-performance C++ inference runtime for Large Language Models (LLMs) and Vision-Language Models (VLMs) on embedded platforms. It enables efficient deployment of state-of-the-art language models on resource-constrained devices such as NVIDIA Jetson and NVIDIA DRIVE platforms.
Supported Platforms#
Hardware Platforms#
Officially Supported Platforms:
Platform |
Software Release |
Link |
|---|---|---|
NVIDIA Jetson Thor |
JetPack 7.1 |
|
NVIDIA DRIVE Thor |
NVIDIA DriveOS 7 |
Note: The platforms listed above are officially supported and tested. While TensorRT Edge-LLM may run on other NVIDIA GPU platforms (for example, discrete GPUs, other Jetson devices), these are not officially supported but may be used for experimental purposes.
Compatible Platforms:
Platform |
Software Release |
|---|---|
NVIDIA Jetson Orin |
JetPack 6.2.x |
Note: TensorRT Edge-LLM will officially support Jetson Orin via later JetPack releases. While JetPack 6.2.x is compatible, the support is experimental.
Supported Model Families#
TensorRT Edge-LLM supports Llama/Qwen/Nemotron language models, Qwen and InternVL vision-language models, Phi-4-Multimodal, Qwen3-ASR/TTS, Nemotron-Omni, EAGLE3 draft models, and selected MoE checkpoints. For the complete support matrix, including Transformers class names, example checkpoints, precision requirements, and platform compatibility, see Supported Models.
Key Features#
π High Performance: Optimized CUDA kernels and TensorRT integration for minimum latency
πΎ Memory Efficient: Supporting 4-bit quantization for reduced memory footprint, with FP8 KV cache support for additional memory savings
π Production Ready: C++-only runtime with no Python dependencies, designed for deployment on edge devices
π― Edge Optimized: Built specifically for NVIDIA Jetson and DRIVE platforms with platform-specific optimizations
π§ Rich Feature Set: Supports LoRA adapters, EAGLE3 speculative decoding, system prompt caching, vision-language models, and an experimental high-level Python API/server
π Complete Toolkit: End-to-end workflow from checkpoint export to C++ runtime, with engine builder and examples
Key Components#
Code Location:
experimental/quantization/(checkpoint quantization),experimental/llm_loader/(ONNX export),experimental/server/(Python API/server),cpp/(runtime),examples/(C++ examples)
TensorRT Edge-LLM uses a three-stage pipeline:
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph LR
HF_MODEL[HuggingFace Models<br>*including pre-quantized<br>checkpoints*]
PYTHON_EXPORT(Checkpoint-Based<br>Model Loader)
ONNX_MODEL[ONNX<br>Model]
ENGINE_BUILDER(Engine Builder)
TRT_ENGINE[TensorRT<br>Engines]
CPP_RUNTIME(C++ Runtime)
SAMPLES(Examples)
APPLICATIONS(Applications)
HF_MODEL --> PYTHON_EXPORT
PYTHON_EXPORT --> ONNX_MODEL
ONNX_MODEL --> ENGINE_BUILDER
ENGINE_BUILDER --> TRT_ENGINE
TRT_ENGINE --> CPP_RUNTIME
CPP_RUNTIME --> SAMPLES
SAMPLES --> APPLICATIONS
classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
classDef itemNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
class HF_MODEL inputNode
class ONNX_MODEL,TRT_ENGINE itemNode
class PYTHON_EXPORT,ENGINE_BUILDER,CPP_RUNTIME nvNode
class APPLICATIONS darkNode
class SAMPLES nvNode
Component |
Description |
|---|---|
Experimental Quantization Package |
Creates quantized HuggingFace-style checkpoints for |
Checkpoint-Based Model Loader |
Recommended. Reads HuggingFace checkpoints directly and exports ONNX artifacts. Learn More |
Experimental Python API and Server |
Provides a vLLM-style Python API and OpenAI-compatible server. Learn More |
Legacy Python Export Pipeline |
Deprecated. FX-tracing compatibility pipeline for existing workflows. |
Engine Builder |
C++-based application that compiles ONNX models into optimized TensorRT engines. Learn More |
C++ Runtime |
C++-based runtime that executes TensorRT engines with CUDA graphs, LoRA, and EAGLE support. Learn More |
Examples |
Reference implementations demonstrating LLM, multimodal, and utility use cases. See the Quick Start Guide and example guides in the User Guide. |
Next Steps#
Ready to get started with TensorRT Edge-LLM? Follow these steps:
Installation Guide - Set up the Python export pipeline on your x86 host and build the C++ runtime on your edge device
Quick Start Guide - Run your first LLM inference in ~15 minutes with step-by-step instructions
Examples - Explore advanced workflows including VLM inference, speculative decoding, ASR, MoE, and TTS
For questions or issues, visit our TensorRT Edge-LLM GitHub repository.