Welcome to TensorRT LLM’s Documentation!#
Getting Started
Deployment Guide
- LLM Examples
- Online Serving Examples
- Curl Chat Client
- Curl Chat Client For Multimodal
- Curl Completion Client
- Deepseek R1 Reasoning Parser
- Genai Perf Client
- Genai Perf Client For Multimodal
- OpenAI Chat Client
- OpenAI Chat Client for Multimodal
- OpenAI Completion Client
- Openai Completion Client For Lora
- OpenAI Completion Client with JSON Schema
- Dynamo K8s Example
- Model Recipes
- Quick Start Recipe for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
- Quick Start Recipe for Llama3.3 70B on TensorRT LLM - Blackwell & Hopper Hardware
- Quick Start Recipe for Llama4 Scout 17B on TensorRT LLM - Blackwell & Hopper Hardware
- Quick Start Recipe for GPT-OSS on TensorRT-LLM - Blackwell Hardware
- Quick Start Recipe for Qwen3 Next on TensorRT LLM - Blackwell & Hopper Hardware
Models
CLI Reference
API Reference
- LLM API Introduction
- API Reference
LLMMultimodalEncoderCompletionOutputRequestOutputGuidedDecodingParamsSamplingParamsDisaggregatedParamsKvCacheConfigKvCacheRetentionConfigCudaGraphConfigMoeConfigLookaheadDecodingConfigMedusaDecodingConfigEagleDecodingConfigMTPDecodingConfigSchedulerConfigCapacitySchedulerPolicyBuildConfigQuantConfigQuantAlgoCalibConfigBuildCacheConfigRequestErrorMpiCommSessionExtendedRuntimePerfKnobConfigBatchingTypeContextChunkingPolicyDynamicBatchConfigCacheTransceiverConfigNGramDecodingConfigUserProvidedDecodingConfigTorchCompileConfigDraftTargetDecodingConfigLlmArgsTorchLlmArgsTrtLlmArgsAutoDecodingConfigAttentionDpConfigLoRARequestSaveHiddenStatesDecodingConfigRocketSparseAttentionConfig
Features
- Feature Combination Matrix
- Multi-Head, Multi-Query, and Group-Query Attention
- Disaggregated Serving (Beta)
- KV Cache System
- Long Sequences
- LoRA (Low-Rank Adaptation)
- Multimodal Support in TensorRT LLM
- Overlap Scheduler
- Paged Attention, IFB, and Request Scheduling
- Parallelism in TensorRT LLM
- Quantization
- Sampling
- Speculative Decoding
- Checkpoint Loading
- AutoDeploy (Prototype)
- Ray Orchestrator (Prototype)
Developer Guide
Blogs
- ADP Balance Strategy
- Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM)
- Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly
- Inference Time Compute Implementation in TensorRT LLM
- Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)
- Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
- DeepSeek R1 MTP Implementation and Optimization
- Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
- Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
- Disaggregated Serving in TensorRT LLM
- How to launch Llama4 Maverick + Eagle3 TensorRT LLM server
- N-Gram Speculative Decoding in TensorRT LLM
- Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
- Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
- How to get best performance on DeepSeek-R1 in TensorRT LLM
- H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM
- New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
- H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token
Quick Links