Skip to main content

Ctrl+K

TensorRT-LLM

TensorRT-LLM

Table of Contents

Getting Started

Overview
Quick Start Guide
Key Features
PyTorch Backend
Release Notes

Installation

Pre-built release container images on NGC
Installing on Linux via pip
Building from Source Code on Linux

LLM API

LLM API Introduction
API Reference

Examples

LLM Examples Introduction
LLM Common Customizations
LLM Examples
Online Serving Examples

Model Definition API

Layers
Functionals
Models
Plugin
Quantization
Runtime

C++ API

Executor
Runtime

Command-Line Reference

trtllm-bench
trtllm-build
trtllm-serve

Architecture

TensorRT-LLM Architecture
Model Definition
TensorRT-LLM Checkpoint
TensorRT-LLM Build Workflow
Adding a Model

Advanced

Multi-Head, Multi-Query, and Group-Query Attention
C++ GPT Runtime
Executor API
Graph Rewriting Module
Run gpt-2b + LoRA using Executor / cpp runtime
Expert Parallelism in TensorRT-LLM
KV Cache Management: Pools, Blocks, and Events
KV cache reuse
Speculative Sampling
Disaggregated-Service (Experimental)

Performance

Overview
Benchmarking
Performance Tuning Guide
Performance Analysis

Reference

Troubleshooting
Support Matrix
Numerical Precision
Memory Usage of TensorRT-LLM
Continuous Integration Overview
Using Dev Containers

Blogs

H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
Speed up inference with SOTA quantization techniques in TRT-LLM
New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
DeepSeek R1 MTP Implementation and Optimization

Welcome to TensorRT-LLM’s Documentation!#

Getting Started

Overview
- About TensorRT-LLM
- What Can You Do With TensorRT-LLM?
Quick Start Guide
Key Features
PyTorch Backend
Release Notes

Installation

Pre-built release container images on NGC
Installing on Linux via pip
Building from Source Code on Linux

Architecture

TensorRT-LLM Architecture
- Model Weights
Model Definition
Compilation
Runtime
Multi-GPU and Multi-Node Support
- Examples
TensorRT-LLM Checkpoint
TensorRT-LLM Build Workflow
Adding a Model

Advanced

Multi-Head, Multi-Query, and Group-Query Attention
C++ GPT Runtime
- Overview
Executor API
Graph Rewriting Module
Run gpt-2b + LoRA using Executor / cpp runtime
- LoRA tensor format details
Expert Parallelism in TensorRT-LLM
KV Cache Management: Pools, Blocks, and Events
- Hierarchy: Pool, Block, and Page
- Events in KVCacheEventManager
KV cache reuse
Speculative Sampling
Disaggregated-Service (Experimental)
- Environment Variables
- Troubleshooting and FAQ

Performance

Overview
- Throughput Measurements
- Reproducing Benchmarked Results
Benchmarking
Performance Tuning Guide
Performance Analysis

Reference

Troubleshooting
Support Matrix
Numerical Precision
Memory Usage of TensorRT-LLM
Continuous Integration Overview
Using Dev Containers

Indices and tables#

Index
Module Index
Search Page

next

Overview

On this page

Welcome to TensorRT-LLM’s Documentation!
Indices and tables

Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025, NVidia.

Last updated on July 19, 2025.

This page is generated by TensorRT-LLM commit 69e9f6d.