TensorRT Edge-LLM Documentation#

Welcome to the TensorRT Edge-LLM documentation. This library provides optimized inference capabilities for large language models and vision-language models on edge devices.

Getting Started#

Get up and running with TensorRT Edge-LLM. Learn about platform overview, key features, use cases, supported platforms, and complete installation instructions for Python and C++ components.

Models#

Learn about supported model families and architectures.

Models

Supported Models

Model Export & Engine Building#

Convert and optimize your models for deployment. Learn how to convert HuggingFace models to ONNX with quantization and compile them into optimized TensorRT engines.

Chat Template Configuration#

Learn how to create and customize chat templates for formatting conversational messages for your models.

Chat Template Configuration

Chat Template Format Guide

C++ Runtime#

Explore the C++ inference runtime and its capabilities, including runtime architecture, standard runtime for text and multimodal inference, EAGLE speculative decoding, CUDA graphs, LoRA, and batch processing.

Examples#

Reference implementations demonstrating LLM, multimodal, and utility use cases.

Examples

Examples

Customization#

Learn how to customize and extend TensorRT Edge-LLM for your specific needs.

Customization

Customization Guide

TensorRT Plugins#

Learn about the usage of TensorRT plugins with TensorRT Edge-LLM and how to make further customizations.

TensorRT Plugins

TensorRT Plugins Guide

APIs#

API documentation for Python and C++ components.

APIs

Need help? Visit our GitHub repository for issues and discussions.