Overview#

Repository: github.com/NVIDIA/TensorRT-Edge-LLM

What is TensorRT Edge-LLM?#

TensorRT Edge-LLM is NVIDIA’s high-performance C++ inference runtime for Large Language Models (LLMs) and Vision-Language Models (VLMs) on embedded platforms. It enables efficient deployment of state-of-the-art language models on resource-constrained devices such as NVIDIA Jetson, NVIDIA DRIVE, and NVIDIA DGX Spark platforms.

Supported Platforms#

Hardware Platforms#

Officially Supported Platforms:

Platform	Software Release	Link
NVIDIA Jetson Thor	JetPack 7.x	JetPack Website
NVIDIA DRIVE Thor	NVIDIA DriveOS 7.2	NVIDIA DRIVE Developer
NVIDIA DGX Spark (GB10)	DGX Spark software stack	NVIDIA DGX Spark Developer
NVIDIA Jetson Orin	JetPack 7.2	JetPack Website

Note: The platforms listed above are officially supported and tested. Jetson Orin supports FP16, INT8, and INT4 model precisions. For exact build flags by platform and JetPack release, see the Installation Guide.

Compatible Platforms:

Platform	Software Release
NVIDIA Jetson Orin	JetPack 6.2+

Note: JetPack 7.2 is the supported Jetson Orin path. JetPack 6.2+ remains compatible for FP16, INT8, and INT4 workflows.

Supported Model Families#

TensorRT Edge-LLM supports the deployment of a wide selection of LLM/VLM/Omni/VLA checkpoints with speculative decoding draft support, including Qwen, Llama, InternVL, Phi, Gemma, Nemotron, Alpamayo, Cosmos, etc. For the complete support matrix, see Supported Models.

Key Features#

🚀 High Performance: Optimized CUDA kernels and TensorRT integration for minimum latency
💾 Memory Efficient: Supporting 4-bit quantization for reduced memory footprint, with FP8 KV cache support for additional memory savings
🔄 Production Ready: C++-only runtime with no Python dependencies, designed for deployment on edge devices
🎯 Edge Optimized: Built specifically for NVIDIA Jetson, DRIVE, and DGX Spark platforms with platform-specific optimizations
🔧 Rich Feature Set: Supports LoRA adapters, EAGLE3, MTP, and DFlash speculative decoding, system prompt caching, vision-language models, and an experimental high-level Python API/server
📊 Complete Toolkit: End-to-end workflow from checkpoint export to C++ runtime, with engine builder and examples

Key Components#

Code Location: tensorrt_edgellm/quantization/ (checkpoint quantization), tensorrt_edgellm/ (ONNX export), experimental/server/ (Python API/server), cpp/ (runtime), examples/ (C++ examples)

TensorRT Edge-LLM uses a three-stage pipeline:

        %%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%

graph LR
    HF_MODEL[HuggingFace Models<br>*including pre-quantized<br>checkpoints*]
    PYTHON_EXPORT(Checkpoint-Based<br>Model Exporter)
    ONNX_MODEL[ONNX<br>Model]
    ENGINE_BUILDER(Engine Builder)
    TRT_ENGINE[TensorRT<br>Engines]
    CPP_RUNTIME(C++ Runtime)
    SAMPLES(Examples)
    APPLICATIONS(Applications)

    HF_MODEL --> PYTHON_EXPORT
    PYTHON_EXPORT --> ONNX_MODEL
    ONNX_MODEL --> ENGINE_BUILDER
    ENGINE_BUILDER --> TRT_ENGINE
    TRT_ENGINE --> CPP_RUNTIME
    CPP_RUNTIME --> SAMPLES
    SAMPLES --> APPLICATIONS

    classDef greyNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
    classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
    classDef darkNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333
    classDef inputNode fill:#f5f5f5,stroke:#999,stroke-width:1px,color:#333
    classDef itemNode fill:#ffffff,stroke:#999,stroke-width:1px,color:#333

    class HF_MODEL inputNode
    class ONNX_MODEL,TRT_ENGINE itemNode
    class PYTHON_EXPORT,ENGINE_BUILDER,CPP_RUNTIME nvNode
    class APPLICATIONS darkNode
    class SAMPLES nvNode

Component	Description
Quantization Package	Creates quantized HuggingFace-style checkpoints for the checkpoint exporter. Usage, Design
Checkpoint Exporter	Reads HuggingFace checkpoints directly and exports ONNX artifacts. Learn More
Experimental Python API and Server	Provides a vLLM-style Python API and OpenAI-compatible server. Learn More
Engine Builder	C++-based application that compiles ONNX models into optimized TensorRT engines. Learn More
C++ Runtime	C++-based runtime that executes TensorRT engines with CUDA graphs, LoRA, and speculative decoding support. Learn More
Examples	Reference implementations demonstrating LLM, multimodal, and utility use cases. See the Quick Start Guide and example guides in the User Guide.

Next Steps#

Ready to get started with TensorRT Edge-LLM? Follow these steps:

Installation Guide - Set up quantization and tensorrt_edgellm on your x86 host and build the C++ runtime on your edge device
Quick Start Guide - Run your first LLM inference in ~15 minutes with step-by-step instructions
Examples - Explore advanced workflows including VLM inference, speculative decoding, ASR, MoE, TTS, and VLA model inference

For questions or issues, visit our TensorRT Edge-LLM GitHub repository.