Overview#

About TensorRT LLM#

TensorRT LLM is NVIDIA’s comprehensive open-source library for accelerating and optimizing inference performance of the latest large language models (LLMs) on NVIDIA GPUs.

Key Capabilities#

🔥 Architected on Pytorch#

TensorRT LLM provides a high-level Python LLM API that supports a wide range of inference setups - from single-GPU to multi-GPU or multi-node deployments. It includes built-in support for various parallelism strategies and advanced features. The LLM API integrates seamlessly with the broader inference ecosystem, including NVIDIA Dynamo and the Triton Inference Server.

TensorRT LLM is designed to be modular and easy to modify. Its PyTorch-native architecture allows developers to experiment with the runtime or extend functionality. Several popular models are also pre-defined and can be customized using native PyTorch code, making it easy to adapt the system to specific needs.

⚡ State-of-the-Art Performance#

TensorRT LLM delivers breakthrough performance on the latest NVIDIA GPUs:

DeepSeek R1: World-record inference performance on Blackwell GPUs
Llama 4 Maverick: Breaks the 1,000 TPS/User Barrier on B200 GPUs

🎯 Comprehensive Model Support#

TensorRT LLM supports the latest and most popular LLM and DiT architectures. See complete list.

Language Models: GPT-OSS, Deepseek-R1/V3, Llama 3/4, Qwen2/3, Gemma 3, Phi 4…
Multi-modal Models: LLaVA-NeXT, Qwen2-VL, VILA, Llama 3.2 Vision…
Visual Generation Models: FLUX, Wan2.1/2.2 for image and video generation.

TensorRT LLM strives to support the most popular models on Day 0.

FP4 Support#

NVIDIA B200 GPUs, when used with TensorRT LLM, enable seamless loading of model weights in the new FP4 format, allowing you to automatically leverage optimized FP4 kernels for efficient and accurate low-precision inference.

FP8 Support#

On NVIDIA H100 and later GPUs, TensorRT LLM supports FP8 quantization, which can double performance and halve memory consumption compared to 16-bit floating point, with minimal impact on model accuracy.

🚀 Advanced Optimization & Production Features#

In-Flight Batching & Paged Attention: In-flight batching eliminates wait times by dynamically managing request execution, processing context and generation phases together for maximum GPU utilization and reduced latency.
Multi-GPU Multi-Node Inference: Seamless distributed inference with tensor, pipeline, and expert parallelism across multiple GPUs and nodes through the Model Definition API.
Advanced Quantization:
- FP4 Quantization: Native support on NVIDIA B200 GPUs with optimized FP4 kernels
- FP8 Quantization: Automatic conversion on NVIDIA H100 GPUs leveraging Hopper architecture
Speculative Decoding: Multiple algorithms including EAGLE, MTP and NGram
KV Cache Management: Paged KV cache with intelligent block reuse and memory optimization
Chunked Prefill: Efficient handling of long sequences by splitting context into manageable chunks
LoRA Support: Multi-adapter support with HuggingFace and NeMo formats, efficient fine-tuning and adaptation
Checkpoint Loading: Flexible model loading from various formats (HuggingFace, NeMo, custom)
Guided Decoding: Advanced sampling with stop words, bad words, and custom constraints
Disaggregated Serving (Beta): Separate context and generation phases across different GPUs for optimal resource utilization

What Can You Do With TensorRT LLM?#

Whether you’re building the next generation of AI applications, optimizing existing LLM deployments, or exploring the frontiers of large language model technology, TensorRT LLM provides the tools, performance, and flexibility you need to succeed in the era of generative AI. To get started, refer to the Quick Start Guide.