Skip to main content

Ctrl+K

TensorRT LLM

TensorRT LLM

Table of Contents

Getting Started

Overview
Quick Start Guide
Installation

Deployment Guide

LLM Examples
Online Serving Examples
Dynamo K8s Example
Model Recipes

Models

Supported Models
Adding a New Model

CLI Reference

trtllm-bench
trtllm-eval
trtllm-serve
- trtllm-serve
- Run benchmarking with trtllm-serve

API Reference

LLM API Introduction
API Reference

Features

Feature Combination Matrix
Multi-Head, Multi-Query, and Group-Query Attention
Disaggregated Serving
KV Cache System
Long Sequences
LoRA (Low-Rank Adaptation)
Multimodal Support in TensorRT LLM
Overlap Scheduler
Paged Attention, IFB, and Request Scheduling
Parallelism in TensorRT LLM
Quantization
Sampling
Additional Outputs
Speculative Decoding
Checkpoint Loading
AutoDeploy (Prototype)
Ray Orchestrator (Prototype)
Torch Compile & Piecewise CUDA Graph

Developer Guide

Architecture Overview
Performance Analysis
TensorRT LLM Benchmarking
Continuous Integration Overview
Using Dev Containers
LLM API Change Guide
Introduction to KV Cache Transmission

Blogs

ADP Balance Strategy
Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM)
Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly
Inference Time Compute Implementation in TensorRT LLM
Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)
Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
DeepSeek R1 MTP Implementation and Optimization
Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
Disaggregated Serving in TensorRT LLM
How to launch Llama4 Maverick + Eagle3 TensorRT LLM server
N-Gram Speculative Decoding in TensorRT LLM
Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
How to get best performance on DeepSeek-R1 in TensorRT LLM
H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM
New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token

Quick Links

Releases
Github Code
Roadmap

Use TensorRT Engine

LLM API with TensorRT Engine

LLM Examples

LLM Examples#

Basics#

Generate text
Generate text asynchronously
Generate text in streaming
Distributed LLM Generation

Customization#

Generate text with guided decoding
Control generated text using logits processor
Generate text with multiple LoRA adapters
Sparse Attention
Speculative Decoding
KV Cache Connector
KV Cache Offloading
Runtime Configuration Examples
Sampling Techniques Showcase

Slurm#

Run LLM-API with pytorch backend on Slurm
Run trtllm-bench with pytorch backend on Slurm
Run trtllm-serve with pytorch backend on Slurm

previous

Building from Source Code on Linux

next

Generate text

On this page

Basics
Customization
Slurm

Privacy Policy | Your Privacy Choices | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025, NVidia.

Last updated on November 23, 2025.

This page is generated by TensorRT-LLM commit a761585.