Welcome to TensorRT LLM’s Documentation!#

Getting Started

Models

CLI Reference

API Reference

Features

Feature Combination Matrix
Multi-Head, Multi-Query, and Group-Query Attention
- Attention Backends
- Implement a New Attention Backend
- The Features of the TrtllmAttention Backend
Disaggregated Serving
- Motivation
- KV Cache Exchange
- Usage
- Environment Variables
- Troubleshooting and FAQ
KV Cache System
- The Basics
- Reuse Across Requests
- Limited Attention Window Size
- MQA / GQA
- Controlling KV Cache Behavior
Long Sequences
- Chunked Context
- Chunked attention
- Sliding Window Attention
LoRA (Low-Rank Adaptation)
- Table of Contents
- Background
- Basic Usage
- Advanced Usage
- TRTLLM serve with LoRA
- TRTLLM bench with LORA
Multimodal Support in TensorRT LLM
- Background
- Optimizations
- Model Support Matrix
- Examples
Overlap Scheduler
- How It Works
- Tradeoff
- Usage
- References
Paged Attention, IFB, and Request Scheduling
- In-flight Batching
- Chunked Context (a.k.a Chunked Prefill)
- KV Cache
- The schedulers
- Revisiting Paged Context Attention and Context Chunking
Parallelism in TensorRT LLM
- Overview of Parallelism Strategies
- Module-level Parallelism Guide
- Wide Expert Parallelism (Wide-EP)
Quantization
- Quantization in TensorRT LLM
- Usage
- Model Supported Matrix
- Hardware Support Matrix
- Quick Links
Sampling
- General usage
- Beam search
- Logits processor
Additional Outputs
- Options
Guided Decoding
- Online API: trtllm-serve
- Offline API: LLM API
Speculative Decoding
- Quick Start
- Usage with trtllm-bench and trtllm-serve
- Developer Guide
- Two Model Speculative Decoding Architecture
Checkpoint Loading
- Table of Contents
- Overview
- Core Components
- Built-in Checkpoint Formats
- Using Checkpoint Loaders
- Creating Custom Checkpoint Loaders
AutoDeploy (Prototype)
- Seamless Model Deployment from PyTorch to TensorRT LLM
- Key Features
- Get Started
- Support Matrix
- Advanced Usage
- Roadmap
Ray Orchestrator (Prototype)
- Motivation
- Basic Usage
- Features
- Roadmap
- Architecture
Torch Compile & Piecewise CUDA Graph
- Table of Contents
- Usage
- Tips for Piecewise CUDA Graph
- Known Issue
- Development Guide
Helix Parallelism
- How Helix Works
- When to Use Helix
- Supported Models
- Configuration
- Testing Helix with TensorRT-LLM
KV Cache Connector
- Use Cases
- Architecture
- Example Implementation

Developer Guide

Blogs

Quick Links

Indices and tables#