Welcome to TensorRT LLM’s Documentation!#
Getting Started
Deployment Guide
- LLM Examples
- Online Serving Examples
- Curl Chat Client
- Curl Chat Client For Multimodal
- Curl Completion Client
- Deepseek R1 Reasoning Parser
- Genai Perf Client
- Genai Perf Client For Multimodal
- OpenAI Chat Client
- OpenAI Chat Client for Multimodal
- OpenAI Completion Client
- Openai Completion Client For Lora
- OpenAI Completion Client with JSON Schema
- Dynamo K8s Example
- Model Recipes
- Quick Start Recipe for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
- Quick Start Recipe for Llama3.3 70B on TensorRT LLM - Blackwell & Hopper Hardware
- Quick Start Recipe for Llama4 Scout 17B on TensorRT LLM - Blackwell & Hopper Hardware
- Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware
Models
CLI Reference
API Reference
- LLM API Introduction
- API Reference
LLM
MultimodalEncoder
CompletionOutput
RequestOutput
GuidedDecodingParams
SamplingParams
DisaggregatedParams
KvCacheConfig
KvCacheRetentionConfig
CudaGraphConfig
MoeConfig
LookaheadDecodingConfig
MedusaDecodingConfig
EagleDecodingConfig
MTPDecodingConfig
SchedulerConfig
CapacitySchedulerPolicy
BuildConfig
QuantConfig
QuantAlgo
CalibConfig
BuildCacheConfig
RequestError
MpiCommSession
ExtendedRuntimePerfKnobConfig
BatchingType
ContextChunkingPolicy
DynamicBatchConfig
CacheTransceiverConfig
NGramDecodingConfig
UserProvidedDecodingConfig
TorchCompileConfig
DraftTargetDecodingConfig
LlmArgs
TorchLlmArgs
TrtLlmArgs
AutoDecodingConfig
AttentionDpConfig
LoRARequest
Features
- Feature Combination Matrix
- Multi-Head, Multi-Query, and Group-Query Attention
- Disaggregated Serving (Beta)
- KV Cache System
- Long Sequences
- LoRA (Low-Rank Adaptation)
- Multimodal Support in TensorRT LLM
- Overlap Scheduler
- Paged Attention, IFB, and Request Scheduling
- Parallelism in TensorRT LLM
- Quantization
- Sampling
- Speculative Decoding
- Checkpoint Loading
- AutoDeploy (Prototype)
Developer Guide
Blogs
- ADP Balance Strategy
- Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM)
- Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
- DeepSeek R1 MTP Implementation and Optimization
- Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
- Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
- Disaggregated Serving in TensorRT LLM
- How to launch Llama4 Maverick + Eagle3 TensorRT LLM server
- N-Gram Speculative Decoding in TensorRT LLM
- Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
- Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
- How to get best performance on DeepSeek-R1 in TensorRT LLM
- H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM
- New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
- H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token
Quick Links