Skip to main content

Ctrl+K

TensorRT LLM

TensorRT LLM

Table of Contents

Getting Started

Overview
Quick Start Guide
Installation

Deployment Guide

LLM Examples
Online Serving Examples
Dynamo K8s Example
Model Recipes

Models

Supported Models
Adding a New Model

CLI Reference

trtllm-bench
trtllm-eval
trtllm-serve
- trtllm-serve
- Run benchmarking with trtllm-serve

API Reference

LLM API Introduction
API Reference

Features

Feature Combination Matrix
Multi-Head, Multi-Query, and Group-Query Attention
Disaggregated Serving
KV Cache System
Long Sequences
LoRA (Low-Rank Adaptation)
Multimodal Support in TensorRT LLM
Overlap Scheduler
Paged Attention, IFB, and Request Scheduling
Parallelism in TensorRT LLM
Quantization
Sampling
Additional Outputs
Speculative Decoding
Checkpoint Loading
AutoDeploy (Prototype)
Ray Orchestrator (Prototype)
Torch Compile & Piecewise CUDA Graph

Developer Guide

Architecture Overview
Performance Analysis
TensorRT LLM Benchmarking
Continuous Integration Overview
Using Dev Containers
LLM API Change Guide
Introduction to KV Cache Transmission

Blogs

ADP Balance Strategy
Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM)
Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly
Inference Time Compute Implementation in TensorRT LLM
Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)
Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
DeepSeek R1 MTP Implementation and Optimization
Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
Disaggregated Serving in TensorRT LLM
How to launch Llama4 Maverick + Eagle3 TensorRT LLM server
N-Gram Speculative Decoding in TensorRT LLM
Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
How to get best performance on DeepSeek-R1 in TensorRT LLM
H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM
New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token

Quick Links

Releases
Github Code
Roadmap

Use TensorRT Engine

LLM API with TensorRT Engine

Welcome to TensorRT LLM’s Documentation!#

Getting Started

Overview
Quick Start Guide
Installation

Deployment Guide

LLM Examples
Online Serving Examples
Dynamo K8s Example
Model Recipes
- Quick Start for Popular Models
- Model-Specific Deployment Guides

Models

Supported Models
- Model-Feature Support Matrix(Key Models)
Multimodal Feature Support Matrix (PyTorch Backend)
Adding a New Model

CLI Reference

trtllm-bench
- trtllm-bench
prepare_dataset.py
trtllm-eval
trtllm-serve
- trtllm-serve
- Run benchmarking with trtllm-serve

API Reference

LLM API Introduction
API Reference

Features

Feature Combination Matrix
Multi-Head, Multi-Query, and Group-Query Attention
Disaggregated Serving
KV Cache System
Long Sequences
LoRA (Low-Rank Adaptation)
Multimodal Support in TensorRT LLM
Overlap Scheduler
Paged Attention, IFB, and Request Scheduling
Parallelism in TensorRT LLM
Quantization
Sampling
Additional Outputs
- Options
Speculative Decoding
Checkpoint Loading
AutoDeploy (Prototype)
Ray Orchestrator (Prototype)
Torch Compile & Piecewise CUDA Graph

Developer Guide

Architecture Overview
- Runtime Optimizations
Performance Analysis
TensorRT LLM Benchmarking
- Before Benchmarking
- Throughput Benchmarking
Continuous Integration Overview
Using Dev Containers
LLM API Change Guide
Introduction to KV Cache Transmission

Blogs

ADP Balance Strategy
Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM)
Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly
Inference Time Compute Implementation in TensorRT LLM
Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)
Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
DeepSeek R1 MTP Implementation and Optimization
Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
Disaggregated Serving in TensorRT LLM
How to launch Llama4 Maverick + Eagle3 TensorRT LLM server
N-Gram Speculative Decoding in TensorRT LLM
Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
How to get best performance on DeepSeek-R1 in TensorRT LLM
H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM
- H200 vs H100
- Latest HBM Memory
New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
- Llama-70B on H200 up to 2.4x increased throughput with XQA within same latency budget
H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token
- MLPerf on H100 with FP8
- What is H100 FP8?

Quick Links

Releases
Github Code
Roadmap

Indices and tables#

Index
Module Index
Search Page

next

Overview

On this page

Welcome to TensorRT LLM’s Documentation!
Indices and tables

Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025, NVidia.

Last updated on November 23, 2025.

This page is generated by TensorRT-LLM commit a761585.