Welcome to TensorRT-LLM’s Documentation!#
Getting Started
- Overview
- Quick Start Guide
- Key Features
- PyTorch Backend
- Release Notes
- TensorRT-LLM Release 0.19.0
- TensorRT-LLM Release 0.18.2
- TensorRT-LLM Release 0.18.1
- TensorRT-LLM Release 0.18.0
- TensorRT-LLM Release 0.17.0
- TensorRT-LLM Release 0.16.0
- TensorRT-LLM Release 0.15.0
- TensorRT-LLM Release 0.14.0
- TensorRT-LLM Release 0.13.0
- TensorRT-LLM Release 0.12.0
- TensorRT-LLM Release 0.11.0
- TensorRT-LLM Release 0.10.0
- TensorRT-LLM Release 0.9.0
- TensorRT-LLM Release 0.8.0
- TensorRT-LLM Release 0.7.1
Installation
Architecture
Advanced
- Multi-Head, Multi-Query, and Group-Query Attention
- C++ GPT Runtime
- Executor API
- Graph Rewriting Module
- Run gpt-2b + LoRA using Executor / cpp runtime
- Expert Parallelism in TensorRT-LLM
- KV Cache Management: Pools, Blocks, and Events
- KV cache reuse
- Speculative Sampling
- Disaggregated-Service (experimental)