Skip to main content

Ctrl+K

TensorRT Edge-LLM

TensorRT Edge-LLM

Table of Contents

Getting Started

Overview
Supported Models
Installation
Quick Start Guide
Limitations and Known Issues

Examples

Examples
VLM (Vision-Language Model) Inference
Speculative Decoding
Phi-4-Multimodal
ASR (Automatic Speech Recognition)
MoE (Mixture of Experts)
TTS (Text-to-Speech)
Alpamayo-R1-10B (Vision-Language-Action)
Experimental High-Level Python API and Server

Features

LoRA (Low-Rank Adaptation)
Quantization
Vocabulary Reduction
FP8 KV Cache
FP8 Embedding
Streaming Output
System Prompt Cache

Input & Chat Format

Input JSON Format
Chat Template Format

Performance

Performance Benchmarks

Software Design

Checkpoint Exporter Design
Quantization Package Design
Engine Builder
C++ Runtime Overview
LLM Inference Runtime
LLM Streaming — Design

Customization

Customization Guide
TensorRT Plugins Guide

Testing

Code Coverage with SonarQube

APIs

Python API Reference
C++ API Reference

Quick Links

Releases
GitHub
Roadmap

C++ API Reference
Runtime Module

Runtime Module#

API documentation for the runtime module.

previous

Timer

next

Audio Loader

Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025, Nvidia.

Last updated on July 02, 2026.

This page is generated by TensorRT-Edge-LLM commit 30941a6.