TensorRT Edge-LLM Documentation#
Welcome to the TensorRT Edge-LLM documentation. This library provides optimized inference capabilities for large language models and vision-language models on edge devices.
Getting Started#
Get up and running with TensorRT Edge-LLM. Learn about platform overview, key features, use cases, supported platforms, and complete installation instructions for Python and C++ components.
Getting Started
Models#
Learn about supported model families and architectures.
Model Export & Engine Building#
Convert and optimize your models for deployment. Learn how to convert HuggingFace models to ONNX with quantization and compile them into optimized TensorRT engines.
Model Export & Engine Building
Chat Template Configuration#
Learn how to create and customize chat templates for formatting conversational messages for your models.
C++ Runtime#
Explore the C++ inference runtime and its capabilities, including runtime architecture, standard runtime for text and multimodal inference, EAGLE speculative decoding, CUDA graphs, LoRA, and batch processing.
Examples#
Reference implementations demonstrating LLM, multimodal, and utility use cases.
Customization#
Learn how to customize and extend TensorRT Edge-LLM for your specific needs.
TensorRT Plugins#
Learn about the usage of TensorRT plugins with TensorRT Edge-LLM and how to make further customizations.
TensorRT Plugins
APIs#
API documentation for Python and C++ components.
APIs
Need help? Visit our GitHub repository for issues and discussions.