Welcome to TensorRT LLM’s Documentation!#

Getting Started

Models

CLI Reference

API Reference

Features

Feature Combination Matrix
Multi-Head, Multi-Query, and Group-Query Attention
- Attention Backends
- Implement a New Attention Backend
- The Features of the TrtllmAttention Backend
Disaggregated Serving
- Motivation
- KV Cache Exchange
- Usage
- Environment Variables
- Troubleshooting and FAQ
Embeddings (Encoder-Only Models)
- Quick start
- Request fields
- Dynamic batching
- Error handling
- Output semantics and scope
- Scaling out across GPUs
- Relationship to llm.encode()
KV Cache System
- The Basics
- Reuse Across Requests
- Limited Attention Window Size
- MQA / GQA
- Controlling KV Cache Behavior
Long Sequences
- Chunked Context
- Chunked attention
- Sliding Window Attention
LoRA (Low-Rank Adaptation)
- Table of Contents
- Background
- Basic Usage
- Advanced Usage
- TRTLLM serve with LoRA
- TRTLLM bench with LoRA
Multimodal Support in TensorRT LLM
- Background
- Optimizations
- Model Support Matrix
- Optional dependencies
- Examples
Overlap Scheduler
- How It Works
- Tradeoff
- Usage
- References
Paged Attention, IFB, and Request Scheduling
- In-flight Batching
- Chunked Context (a.k.a Chunked Prefill)
- KV Cache
- The schedulers
- Revisiting Paged Context Attention and Context Chunking
Parallelism in TensorRT LLM
- Overview of Parallelism Strategies
- Module-level Parallelism Guide
- Wide Expert Parallelism (Wide-EP)
Quantization
- Quantization in TensorRT LLM
- Usage
- Model Support Matrix
- Hardware Support Matrix
- Quick Links
Sampling
- General usage
- Beam search
- Logits processor
Additional Outputs
- Options
Post-Processing Hook
- Enabling the hook
- The hook interface
- Usage examples
- Per-request state
- Supported endpoints and limitations
- Tests
Guided Decoding
- Online API: trtllm-serve
- Offline API: LLM API
Speculative Decoding
- Quick Start
- Suffix Automaton (SA) Enhancement
- Usage with trtllm-bench and trtllm-serve
Checkpoint Loading
- Table of Contents
- Overview
- Core Components
- Built-in Checkpoint Formats
- Using Checkpoint Loaders
- Creating Custom Checkpoint Loaders
ModelExpress (MX) Checkpoint Loading
- Current Support Scope
- Installation
- Deploy the MX Service
- Configure TensorRT LLM
- Configuration
- Notes and Limitations
AutoDeploy (Beta)
- Seamless Model Deployment from PyTorch to TensorRT LLM
- Key Features
- Get Started
- Support Matrix
- API Reference
- Advanced Usage
- Roadmap
AutoDeploy Transforms
- Core Transform APIs
- Factory Stage
- Export Stage
- Post-Export Stage
- Pattern Matching Stage
- Sharding Stage
- Weight Loading Stage
- Post-Load Fusion Stage
- Cache Initialization Stage
- Visualization Stage
- Compilation Stage
- Additional Registered Transforms
Ray Orchestrator (Prototype)
- Motivation
- Basic Usage
- Features
- Roadmap
- Architecture
Torch Compile & Piecewise CUDA Graph
- Table of Contents
- Usage
- Tips for Piecewise CUDA Graph
- Known Issue
- Development Guide
Helix Parallelism
- How Helix Works
- When to Use Helix
- Supported Models
- Configuration
- Testing Helix with TensorRT-LLM
KV Cache Connector
- Use Cases
- Architecture
- Example Implementation
Sparse Attention
- Overview
- RocketKV
- DeepSeek Sparse Attention (DSA)
- Skip Softmax Attention
- Algorithm Comparison
- Further Reading

Developer Guide

Blogs

Quick Links

Migration

Migration Guide: TensorRT Backend Removed

Indices and tables#