Skip to main content
Ctrl+K
TensorRT-LLM - Home TensorRT-LLM - Home

TensorRT-LLM

TensorRT-LLM - Home TensorRT-LLM - Home

TensorRT-LLM

Table of Contents

Getting Started

  • Overview
  • Quick Start Guide
  • Installation
    • Pre-built release container images on NGC
    • Installing on Linux via pip
    • Building from Source Code on Linux

Deployment Guide

  • LLM Examples
    • Generate text
    • Generate text asynchronously
    • Generate text in streaming
    • Distributed LLM Generation
    • Generate text with guided decoding
    • Control generated text using logits processor
    • Generate text with multiple LoRA adapters
    • Speculative Decoding
    • KV Cache Connector
    • Runtime Configuration Examples
    • Sampling Techniques Showcase
    • Run LLM-API with pytorch backend on Slurm
    • Run trtllm-bench with pytorch backend on Slurm
    • Run trtllm-serve with pytorch backend on Slurm
  • Online Serving Examples
    • Curl Chat Client
    • Curl Chat Client For Multimodal
    • Curl Completion Client
    • Deepseek R1 Reasoning Parser
    • Genai Perf Client
    • Genai Perf Client For Multimodal
    • OpenAI Chat Client
    • OpenAI Chat Client for Multimodal
    • OpenAI Completion Client
    • Openai Completion Client For Lora
    • OpenAI Completion Client with JSON Schema
  • Dynamo K8s Example
  • Model Recipes
    • Quick Start Recipe for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
    • Quick Start Recipe for Llama3.3 70B on TensorRT LLM - Blackwell & Hopper Hardware
    • Quick Start Recipe for Llama4 Scout 17B on TensorRT LLM - Blackwell & Hopper Hardware
    • Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware

Models

  • Supported Models
  • Adding a New Model

CLI Reference

  • trtllm-bench
  • trtllm-eval
  • trtllm-serve
    • trtllm-serve
    • Run benchmarking with trtllm-serve

API Reference

  • LLM API Introduction
  • API Reference

Features

  • Feature Combination Matrix
  • Multi-Head, Multi-Query, and Group-Query Attention
  • Disaggregated Serving (Beta)
  • KV Cache System
  • Long Sequences
  • LoRA (Low-Rank Adaptation)
  • Multimodal Support in TensorRT LLM
  • Overlap Scheduler
  • Paged Attention, IFB, and Request Scheduling
  • Parallelism in TensorRT LLM
  • Quantization
  • Sampling
  • Speculative Decoding
  • Checkpoint Loading
  • AutoDeploy (Prototype)

Developer Guide

  • Architecture Overview
  • Performance Analysis
  • TensorRT LLM Benchmarking
  • Continuous Integration Overview
  • Using Dev Containers

Blogs

  • ADP Balance Strategy
  • Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM)
  • Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
  • DeepSeek R1 MTP Implementation and Optimization
  • Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
  • Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
  • Disaggregated Serving in TensorRT LLM
  • How to launch Llama4 Maverick + Eagle3 TensorRT LLM server
  • N-Gram Speculative Decoding in TensorRT LLM
  • Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
  • Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
  • How to get best performance on DeepSeek-R1 in TensorRT LLM
  • H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM
  • New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
  • H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token

Quick Links

  • Releases
  • Github Code
  • Roadmap

Use TensorRT Engine

  • LLM API with TensorRT Engine
  • Model Recipes

Model Recipes#

Model Recipes

  • Quick Start Recipe for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
  • Quick Start Recipe for Llama3.3 70B on TensorRT LLM - Blackwell & Hopper Hardware
  • Quick Start Recipe for Llama4 Scout 17B on TensorRT LLM - Blackwell & Hopper Hardware
  • Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware

previous

Dynamo K8s Example

next

Quick Start Recipe for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware

NVIDIA NVIDIA
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025, NVidia.

Last updated on September 09, 2025.

This page is generated by TensorRT-LLM commit 62b564a.