Skip to main content

Ctrl+K

TensorRT LLM

TensorRT LLM

Table of Contents

Getting Started

Overview
Quick Start Guide
Installation

Deployment Guide

LLM Examples
Online Serving Examples
Dynamo K8s Example
Model Recipes

Models

Supported Models
Adding a New Model

CLI Reference

trtllm-bench
trtllm-eval
trtllm-serve
- trtllm-serve
- Run benchmarking with trtllm-serve

API Reference

LLM API Introduction
API Reference

Features

Feature Combination Matrix
Multi-Head, Multi-Query, and Group-Query Attention
Disaggregated Serving (Beta)
KV Cache System
Long Sequences
LoRA (Low-Rank Adaptation)
Multimodal Support in TensorRT LLM
Overlap Scheduler
Paged Attention, IFB, and Request Scheduling
Parallelism in TensorRT LLM
Quantization
Sampling
Speculative Decoding
Checkpoint Loading
AutoDeploy (Prototype)
Ray Orchestrator (Prototype)

Developer Guide

Architecture Overview
Performance Analysis
TensorRT LLM Benchmarking
Continuous Integration Overview
Using Dev Containers
LLM API Change Guide

Blogs

ADP Balance Strategy
Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM)
Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly
Inference Time Compute Implementation in TensorRT LLM
Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)
Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
DeepSeek R1 MTP Implementation and Optimization
Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
Disaggregated Serving in TensorRT LLM
How to launch Llama4 Maverick + Eagle3 TensorRT LLM server
N-Gram Speculative Decoding in TensorRT LLM
Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
How to get best performance on DeepSeek-R1 in TensorRT LLM
H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM
New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token

Quick Links

Releases
Github Code
Roadmap

Use TensorRT Engine

LLM API with TensorRT Engine

Model Recipes

Model Recipes#

Model Recipes

Quick Start Recipe for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
Quick Start Recipe for Llama3.3 70B on TensorRT LLM - Blackwell & Hopper Hardware
Quick Start Recipe for Llama4 Scout 17B on TensorRT LLM - Blackwell & Hopper Hardware
Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware
Quick Start Recipe for Qwen3 Next on TensorRT LLM - Blackwell & Hopper Hardware

previous

Dynamo K8s Example

next

Quick Start Recipe for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware

Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025, NVidia.

Last updated on October 19, 2025.

This page is generated by TensorRT-LLM commit 796891b.