Skip to main content

Ctrl+K

TensorRT Edge-LLM

TensorRT Edge-LLM

Table of Contents

Getting Started

Overview
Supported Models
Installation
Quick Start Guide
Limitations and Known Issues

Examples

Examples
VLM (Vision-Language Model) Inference
Speculative Decoding
Phi-4-Multimodal
ASR (Automatic Speech Recognition)
MoE (Mixture of Experts)
TTS (Text-to-Speech)
Alpamayo-R1-10B (Vision-Language-Action)
Experimental High-Level Python API and Server

Features

LoRA (Low-Rank Adaptation)
Quantization
Vocabulary Reduction
FP8 KV Cache
FP8 Embedding
Streaming Output
System Prompt Cache

Input & Chat Format

Input JSON Format
Chat Template Format

Performance

Performance Benchmarks

Software Design

Checkpoint Exporter Design
Quantization Package Design
Engine Builder
C++ Runtime Overview
LLM Inference Runtime
LLM Streaming — Design

Customization

Customization Guide
Calibration Dataset Customization
TensorRT Plugins Guide

Testing

Code Coverage with SonarQube
Few-Layer Numeric Validation

APIs

Python API Reference
C++ API Reference

Quick Links

Releases
GitHub
Roadmap

C++ API Reference

C++ API Reference#

This section provides documentation for the TensorRT Edge-LLM C++ API.

Action Module
Builder Module
Common Module
- Binding Names
- Check Macros
- CUDA Macros
- CUDA Utils
- File Utils
- Hash Utils
- Input Limits
- Logger
- Math Utils
- MMAP Reader
- Rope Utils
- Safetensors Utils
- String Utils
- Tensor
- TRT Utils
- Utf8
- Version
Kernels Module
Multimodal Module
Plugins Module
Profiling Module
Runtime Module
Sampler Module
- Sampling
Tokenizer Module

previous

Python API Reference

next

Action Module

Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025, Nvidia.

Last updated on July 22, 2026.

This page is generated by TensorRT-Edge-LLM commit f98267f.