Welcome to TensorRT-LLM’s Documentation!#
Getting Started
- Overview
- Quick Start Guide
- Key Features
- PyTorch Backend
- Release Notes
- TensorRT-LLM Release 0.21.0
- TensorRT-LLM Release 0.20.0
- TensorRT-LLM Release 0.19.0
- TensorRT-LLM Release 0.18.2
- TensorRT-LLM Release 0.18.1
- TensorRT-LLM Release 0.18.0
- TensorRT-LLM Release 0.17.0
- TensorRT-LLM Release 0.16.0
- TensorRT-LLM Release 0.15.0
- TensorRT-LLM Release 0.14.0
- TensorRT-LLM Release 0.13.0
- TensorRT-LLM Release 0.12.0
- TensorRT-LLM Release 0.11.0
- TensorRT-LLM Release 0.10.0
- TensorRT-LLM Release 0.9.0
- TensorRT-LLM Release 0.8.0
- TensorRT-LLM Release 0.7.1
Installation
Deployment Guide
- Quick Start Recipe for Llama4 Scout 17B on TensorRT-LLM - Blackwell & Hopper Hardware
- Quick Start Recipe for DeepSeek R1 on TensorRT-LLM - Blackwell & Hopper Hardware
- Quick Start Recipe for Llama3.3 70B on TensorRT-LLM - Blackwell & Hopper Hardware
- Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware
Command-Line Reference
Architecture
Advanced
- Multi-Head, Multi-Query, and Group-Query Attention
- C++ GPT Runtime
- Executor API
- Graph Rewriting Module
- Run gpt-2b + LoRA using Executor / cpp runtime
- Expert Parallelism in TensorRT-LLM
- KV Cache Management: Pools, Blocks, and Events
- KV cache reuse
- Speculative Sampling
- Disaggregated-Service (Prototype)
Reference
Blogs
- H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
- H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
- Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
- Speed up inference with SOTA quantization techniques in TRT-LLM
- New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
- ADP Balance Strategy
- Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
- DeepSeek R1 MTP Implementation and Optimization
- Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
- Scaling Expert Parallelism in TensorRT-LLM (Part 1: Design and Implementation of Large-scale EP)
- Disaggregated Serving in TensorRT-LLM
- How to launch Llama4 Maverick + Eagle3 TensorRT-LLM server
- N-Gram Speculative Decoding in TensorRT‑LLM
- Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization)
- Running a High Performance GPT-OSS-120B Inference Server with TensorRT-LLM