Welcome to TensorRT LLM’s Documentation!#
Getting Started
Deployment Guide
- LLM Examples
- Online Serving Examples
- Curl Chat Client
- Curl Chat Client For Multimodal
- Curl Completion Client
- Deepseek R1 Reasoning Parser
- Genai Perf Client
- Genai Perf Client For Multimodal
- OpenAI Chat Client
- OpenAI Chat Client for Multimodal
- OpenAI Completion Client
- Openai Completion Client For Lora
- OpenAI Completion Client with JSON Schema
- Dynamo K8s Example
- Model Recipes
Models
CLI Reference
API Reference
- LLM API Introduction
- API Reference
LLMCompletionOutputRequestOutputGuidedDecodingParamsSamplingParamsDisaggregatedParamsKvCacheConfigKvCacheRetentionConfigCudaGraphConfigMoeConfigLookaheadDecodingConfigMedusaDecodingConfigEagleDecodingConfigMTPDecodingConfigSchedulerConfigCapacitySchedulerPolicyBuildConfigQuantConfigQuantAlgoCalibConfigBuildCacheConfigRequestErrorMpiCommSessionExtendedRuntimePerfKnobConfigBatchingTypeContextChunkingPolicyDynamicBatchConfigCacheTransceiverConfigNGramDecodingConfigUserProvidedDecodingConfigTorchCompileConfigDraftTargetDecodingConfigLlmArgsTorchLlmArgsTrtLlmArgsAutoDecodingConfigAttentionDpConfigLoRARequest
Features
- Feature Combination Matrix
- Multi-Head, Multi-Query, and Group-Query Attention
- Disaggregated Serving (Beta)
- KV Cache System
- Long Sequences
- LoRA (Low-Rank Adaptation)
- Multimodal Support in TensorRT LLM
- Overlap Scheduler
- Paged Attention, IFB, and Request Scheduling
- Parallelism in TensorRT LLM
- Quantization
- Sampling
- Speculative Decoding
- Checkpoint Loading
- AutoDeploy (Prototype)
Developer Guide
Blogs
- Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
- DeepSeek R1 MTP Implementation and Optimization
- Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
- Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
- Disaggregated Serving in TensorRT LLM
- How to launch Llama4 Maverick + Eagle3 TensorRT LLM server
- N-Gram Speculative Decoding in TensorRT LLM
- Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
- Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
- How to get best performance on DeepSeek-R1 in TensorRT LLM
- H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM
- New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
- H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token
Quick Links