Welcome to TensorRT LLM’s Documentation!#
Getting Started
Deployment Guide
- LLM Examples
- Online Serving Examples
- Curl Chat Client
- Curl Chat Client For Multimodal
- Curl Completion Client
- Deepseek R1 Reasoning Parser
- Genai Perf Client
- Genai Perf Client For Multimodal
- OpenAI Chat Client
- OpenAI Chat Client for Multimodal
- OpenAI Completion Client
- Openai Completion Client For Lora
- OpenAI Completion Client with JSON Schema
- Dynamo K8s Example
- Model Recipes
Models
CLI Reference
API Reference
- LLM API Introduction
- API Reference
LLMMultimodalEncoderCompletionOutputRequestOutputGuidedDecodingParamsSamplingParamsDisaggregatedParamsKvCacheConfigKvCacheRetentionConfigCudaGraphConfigMoeConfigLookaheadDecodingConfigMedusaDecodingConfigEagleDecodingConfigMTPDecodingConfigSchedulerConfigCapacitySchedulerPolicyBuildConfigQuantConfigQuantAlgoCalibConfigBuildCacheConfigRequestErrorMpiCommSessionExtendedRuntimePerfKnobConfigBatchingTypeContextChunkingPolicyDynamicBatchConfigCacheTransceiverConfigNGramDecodingConfigUserProvidedDecodingConfigTorchCompileConfigDraftTargetDecodingConfigLlmArgsTorchLlmArgsTrtLlmArgsAutoDecodingConfigAttentionDpConfigLoRARequestSaveHiddenStatesDecodingConfigRocketSparseAttentionConfigDeepSeekSparseAttentionConfig
Features
- Feature Combination Matrix
- Multi-Head, Multi-Query, and Group-Query Attention
- Disaggregated Serving
- KV Cache System
- Long Sequences
- LoRA (Low-Rank Adaptation)
- Multimodal Support in TensorRT LLM
- Overlap Scheduler
- Paged Attention, IFB, and Request Scheduling
- Parallelism in TensorRT LLM
- Quantization
- Sampling
- Additional Outputs
- Speculative Decoding
- Checkpoint Loading
- AutoDeploy (Prototype)
- Ray Orchestrator (Prototype)
- Torch Compile & Piecewise CUDA Graph
Developer Guide
Blogs
- ADP Balance Strategy
- Running GPT-OSS-120B with Eagle3 Speculative Decoding on GB200/B200 (TensorRT LLM)
- Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly
- Inference Time Compute Implementation in TensorRT LLM
- Scaling Expert Parallelism in TensorRT LLM (Part 3: Pushing the Performance Boundary)
- Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
- DeepSeek R1 MTP Implementation and Optimization
- Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
- Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
- Disaggregated Serving in TensorRT LLM
- How to launch Llama4 Maverick + Eagle3 TensorRT LLM server
- N-Gram Speculative Decoding in TensorRT LLM
- Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
- Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
- How to get best performance on DeepSeek-R1 in TensorRT LLM
- H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM
- New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
- H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token
Quick Links