Welcome to TensorRT LLM’s Documentation!#
Getting Started
Deployment Guide
- LLM Examples
 - Online Serving Examples
- Curl Chat Client
 - Curl Chat Client For Multimodal
 - Curl Completion Client
 - Deepseek R1 Reasoning Parser
 - Genai Perf Client
 - Genai Perf Client For Multimodal
 - OpenAI Chat Client
 - OpenAI Chat Client for Multimodal
 - OpenAI Completion Client
 - Openai Completion Client For Lora
 - OpenAI Completion Client with JSON Schema
 
 - Dynamo K8s Example
 - Model Recipes
 
Models
CLI Reference
API Reference
- LLM API Introduction
 - API Reference
LLMCompletionOutputRequestOutputGuidedDecodingParamsSamplingParamsDisaggregatedParamsKvCacheConfigKvCacheRetentionConfigCudaGraphConfigMoeConfigLookaheadDecodingConfigMedusaDecodingConfigEagleDecodingConfigMTPDecodingConfigSchedulerConfigCapacitySchedulerPolicyBuildConfigQuantConfigQuantAlgoCalibConfigBuildCacheConfigRequestErrorMpiCommSessionExtendedRuntimePerfKnobConfigBatchingTypeContextChunkingPolicyDynamicBatchConfigCacheTransceiverConfigNGramDecodingConfigUserProvidedDecodingConfigTorchCompileConfigDraftTargetDecodingConfigLlmArgsTorchLlmArgsTrtLlmArgsAutoDecodingConfigAttentionDpConfigLoRARequest
 
Features
- Feature Combination Matrix
 - Multi-Head, Multi-Query, and Group-Query Attention
 - Disaggregated Serving (Beta)
 - KV Cache System
 - Long Sequences
 - LoRA (Low-Rank Adaptation)
 - Multimodal Support in TensorRT LLM
 - Overlap Scheduler
 - Paged Attention, IFB, and Request Scheduling
 - Parallelism in TensorRT LLM
 - Quantization
 - Sampling
 - Speculative Decoding
 - Checkpoint Loading
 - AutoDeploy (Prototype)
 
Developer Guide
Blogs
- Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
 - DeepSeek R1 MTP Implementation and Optimization
 - Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
 - Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
 - Disaggregated Serving in TensorRT LLM
 - How to launch Llama4 Maverick + Eagle3 TensorRT LLM server
 - N-Gram Speculative Decoding in TensorRT LLM
 - Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
 - Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
 - How to get best performance on DeepSeek-R1 in TensorRT LLM
 - H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM
 - New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
 - H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token
 
Quick Links