Skip to main content
Ctrl+K
TensorRT-LLM - Home TensorRT-LLM - Home

TensorRT-LLM

TensorRT-LLM - Home TensorRT-LLM - Home

TensorRT-LLM

Table of Contents

Getting Started

  • Overview
  • Quick Start Guide
  • Installation
    • Pre-built release container images on NGC
    • Installing on Linux via pip
    • Building from Source Code on Linux

Deployment Guide

  • LLM Examples
    • Generate text
    • Generate text asynchronously
    • Generate text in streaming
    • Distributed LLM Generation
    • Generate text with guided decoding
    • Control generated text using logits processor
    • Generate text with multiple LoRA adapters
    • Speculative Decoding
    • KV Cache Connector
    • Runtime Configuration Examples
    • Sampling Techniques Showcase
    • Run LLM-API with pytorch backend on Slurm
    • Run trtllm-bench with pytorch backend on Slurm
    • Run trtllm-serve with pytorch backend on Slurm
  • Online Serving Examples
    • Curl Chat Client
    • Curl Chat Client For Multimodal
    • Curl Completion Client
    • Deepseek R1 Reasoning Parser
    • Genai Perf Client
    • Genai Perf Client For Multimodal
    • OpenAI Chat Client
    • OpenAI Chat Client for Multimodal
    • OpenAI Completion Client
    • Openai Completion Client For Lora
    • OpenAI Completion Client with JSON Schema
  • Dynamo K8s Example
  • Model Recipes
    • Quick Start Recipe for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
    • Quick Start Recipe for Llama3.3 70B on TensorRT LLM - Blackwell & Hopper Hardware
    • Quick Start Recipe for Llama4 Scout 17B on TensorRT LLM - Blackwell & Hopper Hardware
    • Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware

Models

  • Supported Models
  • Adding a New Model

CLI Reference

  • trtllm-bench
  • trtllm-eval
  • trtllm-serve
    • trtllm-serve
    • Run benchmarking with trtllm-serve

API Reference

  • LLM API Introduction
  • API Reference

Features

  • Feature Combination Matrix
  • Multi-Head, Multi-Query, and Group-Query Attention
  • Disaggregated Serving (Beta)
  • KV Cache System
  • Long Sequences
  • LoRA (Low-Rank Adaptation)
  • Multimodal Support in TensorRT LLM
  • Overlap Scheduler
  • Paged Attention, IFB, and Request Scheduling
  • Parallelism in TensorRT LLM
  • Quantization
  • Sampling
  • Speculative Decoding
  • Checkpoint Loading
  • AutoDeploy (Prototype)

Developer Guide

  • Architecture Overview
  • Performance Analysis
  • TensorRT LLM Benchmarking
  • Continuous Integration Overview
  • Using Dev Containers

Blogs

  • ADP Balance Strategy
  • Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM)
  • Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
  • DeepSeek R1 MTP Implementation and Optimization
  • Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
  • Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
  • Disaggregated Serving in TensorRT LLM
  • How to launch Llama4 Maverick + Eagle3 TensorRT LLM server
  • N-Gram Speculative Decoding in TensorRT LLM
  • Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
  • Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
  • How to get best performance on DeepSeek-R1 in TensorRT LLM
  • H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM
  • New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
  • H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token

Quick Links

  • Releases
  • Github Code
  • Roadmap

Use TensorRT Engine

  • LLM API with TensorRT Engine
  • Dynamo K8s Example

Dynamo K8s Example#

  1. Install Dynamo Cloud

Please follow this guide to install Dynamo cloud for your Kubernetes cluster.

  1. Deploy the TRT-LLM Deployment

Dynamo uses custom resource definitions (CRDs) to manage the lifecycle of the deployments. You can use the DynamoDeploymentGraph yaml files to create aggregated, and disaggregated TRT-LLM deployments.

Please see Deploying Dynamo Inference Graphs to Kubernetes using the Dynamo Cloud Platform for more details.

previous

OpenAI Completion Client with JSON Schema

next

Model Recipes

NVIDIA NVIDIA
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025, NVidia.

Last updated on September 09, 2025.

This page is generated by TensorRT-LLM commit 62b564a.