Skip to main content
Ctrl+K
TensorRT-LLM - Home TensorRT-LLM - Home

TensorRT-LLM

TensorRT-LLM - Home TensorRT-LLM - Home

TensorRT-LLM

Table of Contents

Getting Started

  • Overview
  • Quick Start Guide
  • Key Features
  • PyTorch Backend
  • Release Notes

Installation

  • Installing on Linux
  • Building from Source Code on Linux
  • Installing on Grace Hopper

LLM API

  • API Introduction
  • API Reference

LLM API Examples

  • LLM Examples Introduction
    • Generate text with guided decoding
    • Control generated text using logits post processor
    • Generate text
    • Generate Text Asynchronously
    • Generate Text in Streaming
    • Generate text with customization
    • Distributed LLM Generation
    • Generate Text Using Medusa Decoding
    • Generation with Quantization
    • Generate Text Using Lookahead Decoding
    • Generate text with multiple LoRA adapters
    • Automatic Parallelism with LLM
  • Common Customizations
  • Examples
    • Generate text with guided decoding
    • Control generated text using logits post processor
    • Generate text
    • Generate Text Asynchronously
    • Generate Text in Streaming
    • Generate text with customization
    • Distributed LLM Generation
    • Generate Text Using Medusa Decoding
    • Generation with Quantization
    • Generate Text Using Lookahead Decoding
    • Generate text with multiple LoRA adapters
    • Automatic Parallelism with LLM

Model Definition API

  • Layers
  • Functionals
  • Models
  • Plugin
  • Quantization
  • Runtime

C++ API

  • Executor
  • Runtime

Command-Line Reference

  • trtllm-build
  • trtllm-serve

Architecture

  • TensorRT-LLM Architecture
  • Model Definition
  • TensorRT-LLM Checkpoint
  • TensorRT-LLM Build Workflow
  • Adding a Model

Advanced

  • Multi-Head, Multi-Query, and Group-Query Attention
  • C++ GPT Runtime
  • Executor API
  • Graph Rewriting Module
  • Inference Request
  • Run gpt-2b + LoRA using GptManager / cpp runtime
  • Expert Parallelism in TensorRT-LLM
  • KV cache reuse
  • Speculative Sampling
  • Disaggregated-Service (experimental)

Performance

  • Overview
  • Benchmarking
  • Performance Tuning Guide
    • Benchmarking Default Performance
    • Useful Build-Time Flags
    • Tuning Max Batch Size and Max Num Tokens
    • Deciding Model Sharding Strategy
    • FP8 Quantization
    • Useful Runtime Options
  • Performance Analysis

Reference

  • Troubleshooting
  • Support Matrix
  • Numerical Precision
  • Memory Usage of TensorRT-LLM

Blogs

  • H100 has 4.6x A100 Performance in TensorRT-LLM, achieving 10,000 tok/s at 100ms to first token
  • H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM
  • Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100
  • Speed up inference with SOTA quantization techniques in TRT-LLM
  • New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget

Python Module Index

t
 
t
- tensorrt_llm
    tensorrt_llm.functional
    tensorrt_llm.layers.activation
    tensorrt_llm.layers.attention
    tensorrt_llm.layers.cast
    tensorrt_llm.layers.conv
    tensorrt_llm.layers.embedding
    tensorrt_llm.layers.linear
    tensorrt_llm.layers.mlp
    tensorrt_llm.layers.normalization
    tensorrt_llm.layers.pooling
    tensorrt_llm.models
    tensorrt_llm.plugin
    tensorrt_llm.quantization
    tensorrt_llm.runtime
NVIDIA NVIDIA
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2025, NVidia.