Welcome to TensorRT LLM’s Documentation!#

Getting Started

Models

CLI Reference

Features

Blogs

Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs
DeepSeek R1 MTP Implementation and Optimization
Optimizing DeepSeek R1 Throughput on NVIDIA Blackwell GPUs: A Deep Dive for Developers
Scaling Expert Parallelism in TensorRT LLM (Part 1: Design and Implementation of Large-scale EP)
Disaggregated Serving in TensorRT LLM
How to launch Llama4 Maverick + Eagle3 TensorRT LLM server
N-Gram Speculative Decoding in TensorRT LLM
Scaling Expert Parallelism in TensorRT LLM (Part 2: Performance Status and Optimization)
Running a High Performance GPT-OSS-120B Inference Server with TensorRT LLM
How to get best performance on DeepSeek-R1 in TensorRT LLM
H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT LLM
- H200 vs H100
- Latest HBM Memory
New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget
- Llama-70B on H200 up to 2.4x increased throughput with XQA within same latency budget
H100 has 4.6x A100 Performance in TensorRT LLM, achieving 10,000 tok/s at 100ms to first token
- MLPerf on H100 with FP8
- What is H100 FP8?

Quick Links

Indices and tables#