Model Recipes#
Quick Start for Popular Models#
The table below contains trtllm-serve commands that can be used to easily deploy popular models including DeepSeek-R1, gpt-oss, Llama 4, Qwen3, and more.
We maintain LLM API configuration files for these models containing recommended performance settings in two locations:
Curated Examples: examples/configs/curated - Hand-picked configurations for common scenarios.
Comprehensive Database: examples/configs/database - A more comprehensive set of known-good configurations for various GPUs and traffic patterns.
The TensorRT LLM Docker container makes these config files available at /app/tensorrt_llm/examples/configs/curated and /app/tensorrt_llm/examples/configs/database respectively. You can reference them as needed:
export TRTLLM_DIR="/app/tensorrt_llm" # path to the TensorRT LLM repo in your local environment
Note
The configs here are specifically optimized for a target ISL/OSL (Input/Output Sequence Length) of 1024/1024. If your traffic pattern is different, refer to the Preconfigured Recipes section below which covers a larger set of traffic patterns and performance profiles.
This table is designed to provide a straightforward starting point; for detailed model-specific deployment guides, check out the guides below.
Model Name |
GPU |
Inference Scenario |
Config |
Command |
|---|---|---|---|---|
H100, H200 |
Max Throughput |
|
||
B200, GB200 |
Max Throughput |
|
||
B200, GB200 |
Max Throughput |
|
||
B200, GB200 |
Min Latency |
|
||
Any |
Max Throughput |
|
||
Any |
Min Latency |
|
||
Any |
Max Throughput |
|
||
Qwen3 family (e.g. Qwen3-30B-A3B) |
Any |
Max Throughput |
|
|
Any |
Max Throughput |
|
||
Any |
Max Throughput |
|
Model-Specific Deployment Guides#
The deployment guides below provide more detailed instructions for serving specific models with TensorRT LLM.
- Deployment Guide for DeepSeek R1 on TensorRT LLM - Blackwell & Hopper Hardware
- Deployment Guide for Llama3.3 70B on TensorRT LLM - Blackwell & Hopper Hardware
- Deployment Guide for Llama4 Scout 17B on TensorRT LLM - Blackwell & Hopper Hardware
- Deployment Guide for GPT-OSS on TensorRT-LLM - Blackwell Hardware
- Deployment Guide for Qwen3 on TensorRT LLM - Blackwell & Hopper Hardware
- Deployment Guide for Qwen3 Next on TensorRT LLM - Blackwell & Hopper Hardware
- Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell
Preconfigured Recipes#
Recipe selector#
Note
Traffic Patterns: The ISL (Input Sequence Length) and OSL (Output Sequence Length) values in each configuration represent the maximum supported values for that config. Requests exceeding these limits may result in errors.
To handle requests with input sequences longer than the configured ISL, add the following to your config file:
enable_chunked_prefill: true
This enables chunked prefill, which processes long input sequences in chunks rather than requiring them to fit within a single prefill operation. Note that enabling chunked prefill does not guarantee optimal performance—these configs are tuned for the specified ISL/OSL.
Recipe database#
The table below lists all available pre-configured model scenarios in the TensorRT LLM configuration database. Each row represents a specific model, GPU, and performance profile combination with recommended request settings.
DeepSeek-R1#
GPU |
Performance Profile |
ISL / OSL |
Concurrency |
Config |
Command |
|---|---|---|---|---|---|
8xB200_NVL |
Min Latency |
1024 / 1024 |
4 |
|
|
8xB200_NVL |
Low Latency |
1024 / 1024 |
8 |
|
|
8xB200_NVL |
Balanced |
1024 / 1024 |
16 |
|
|
8xB200_NVL |
High Throughput |
1024 / 1024 |
32 |
|
|
8xB200_NVL |
Max Throughput |
1024 / 1024 |
64 |
|
|
8xB200_NVL |
Min Latency |
8192 / 1024 |
4 |
|
|
8xB200_NVL |
Low Latency |
8192 / 1024 |
8 |
|
|
8xB200_NVL |
Balanced |
8192 / 1024 |
16 |
|
|
8xB200_NVL |
High Throughput |
8192 / 1024 |
32 |
|
|
8xB200_NVL |
Max Throughput |
8192 / 1024 |
64 |
|
|
8xH200_SXM |
Min Latency |
1024 / 1024 |
4 |
|
|
8xH200_SXM |
Low Latency |
1024 / 1024 |
8 |
|
|
8xH200_SXM |
Balanced |
1024 / 1024 |
16 |
|
|
8xH200_SXM |
High Throughput |
1024 / 1024 |
32 |
|
|
8xH200_SXM |
Max Throughput |
1024 / 1024 |
64 |
|
|
8xH200_SXM |
Min Latency |
8192 / 1024 |
4 |
|
|
8xH200_SXM |
Low Latency |
8192 / 1024 |
8 |
|
|
8xH200_SXM |
Balanced |
8192 / 1024 |
16 |
|
|
8xH200_SXM |
High Throughput |
8192 / 1024 |
32 |
|
|
8xH200_SXM |
Max Throughput |
8192 / 1024 |
64 |
|
DeepSeek-R1 (NVFP4)#
GPU |
Performance Profile |
ISL / OSL |
Concurrency |
Config |
Command |
|---|---|---|---|---|---|
4xB200_NVL |
Min Latency |
1024 / 1024 |
4 |
|
|
4xB200_NVL |
Low Latency |
1024 / 1024 |
8 |
|
|
4xB200_NVL |
Low Latency |
1024 / 1024 |
16 |
|
|
4xB200_NVL |
Balanced |
1024 / 1024 |
32 |
|
|
4xB200_NVL |
High Throughput |
1024 / 1024 |
64 |
|
|
4xB200_NVL |
High Throughput |
1024 / 1024 |
128 |
|
|
4xB200_NVL |
Max Throughput |
1024 / 1024 |
256 |
|
|
4xB200_NVL |
Min Latency |
8192 / 1024 |
4 |
|
|
4xB200_NVL |
Low Latency |
8192 / 1024 |
8 |
|
|
4xB200_NVL |
Low Latency |
8192 / 1024 |
16 |
|
|
4xB200_NVL |
Balanced |
8192 / 1024 |
32 |
|
|
4xB200_NVL |
High Throughput |
8192 / 1024 |
64 |
|
|
4xB200_NVL |
High Throughput |
8192 / 1024 |
128 |
|
|
4xB200_NVL |
Max Throughput |
8192 / 1024 |
256 |
|
|
8xB200_NVL |
Min Latency |
1024 / 1024 |
4 |
|
|
8xB200_NVL |
Low Latency |
1024 / 1024 |
8 |
|
|
8xB200_NVL |
Low Latency |
1024 / 1024 |
16 |
|
|
8xB200_NVL |
Balanced |
1024 / 1024 |
32 |
|
|
8xB200_NVL |
High Throughput |
1024 / 1024 |
64 |
|
|
8xB200_NVL |
High Throughput |
1024 / 1024 |
128 |
|
|
8xB200_NVL |
Max Throughput |
1024 / 1024 |
256 |
|
|
8xB200_NVL |
Min Latency |
8192 / 1024 |
4 |
|
|
8xB200_NVL |
Low Latency |
8192 / 1024 |
8 |
|
|
8xB200_NVL |
Low Latency |
8192 / 1024 |
16 |
|
|
8xB200_NVL |
Balanced |
8192 / 1024 |
32 |
|
|
8xB200_NVL |
High Throughput |
8192 / 1024 |
64 |
|
|
8xB200_NVL |
High Throughput |
8192 / 1024 |
128 |
|
|
8xB200_NVL |
Max Throughput |
8192 / 1024 |
256 |
|
gpt-oss-120b#
GPU |
Performance Profile |
ISL / OSL |
Concurrency |
Config |
Command |
|---|---|---|---|---|---|
B200_NVL |
Min Latency |
1024 / 1024 |
4 |
|
|
B200_NVL |
Low Latency |
1024 / 1024 |
8 |
|
|
B200_NVL |
Balanced |
1024 / 1024 |
16 |
|
|
B200_NVL |
High Throughput |
1024 / 1024 |
32 |
|
|
B200_NVL |
Max Throughput |
1024 / 1024 |
64 |
|
|
B200_NVL |
Min Latency |
1024 / 8192 |
4 |
|
|
B200_NVL |
Low Latency |
1024 / 8192 |
8 |
|
|
B200_NVL |
Balanced |
1024 / 8192 |
16 |
|
|
B200_NVL |
High Throughput |
1024 / 8192 |
32 |
|
|
B200_NVL |
Max Throughput |
1024 / 8192 |
64 |
|
|
B200_NVL |
Min Latency |
8192 / 1024 |
4 |
|
|
B200_NVL |
Low Latency |
8192 / 1024 |
8 |
|
|
B200_NVL |
Balanced |
8192 / 1024 |
16 |
|
|
B200_NVL |
High Throughput |
8192 / 1024 |
32 |
|
|
B200_NVL |
Max Throughput |
8192 / 1024 |
64 |
|
|
2xB200_NVL |
Min Latency |
1024 / 1024 |
4 |
|
|
2xB200_NVL |
Low Latency |
1024 / 1024 |
8 |
|
|
2xB200_NVL |
Balanced |
1024 / 1024 |
16 |
|
|
2xB200_NVL |
High Throughput |
1024 / 1024 |
32 |
|
|
2xB200_NVL |
Max Throughput |
1024 / 1024 |
64 |
|
|
2xB200_NVL |
Min Latency |
1024 / 8192 |
4 |
|
|
2xB200_NVL |
Low Latency |
1024 / 8192 |
8 |
|
|
2xB200_NVL |
Balanced |
1024 / 8192 |
16 |
|
|
2xB200_NVL |
High Throughput |
1024 / 8192 |
32 |
|
|
2xB200_NVL |
Max Throughput |
1024 / 8192 |
64 |
|
|
2xB200_NVL |
Min Latency |
8192 / 1024 |
4 |
|
|
2xB200_NVL |
Low Latency |
8192 / 1024 |
8 |
|
|
2xB200_NVL |
Balanced |
8192 / 1024 |
16 |
|
|
2xB200_NVL |
High Throughput |
8192 / 1024 |
32 |
|
|
2xB200_NVL |
Max Throughput |
8192 / 1024 |
64 |
|
|
4xB200_NVL |
Min Latency |
1024 / 1024 |
4 |
|
|
4xB200_NVL |
Low Latency |
1024 / 1024 |
8 |
|
|
4xB200_NVL |
Balanced |
1024 / 1024 |
16 |
|
|
4xB200_NVL |
High Throughput |
1024 / 1024 |
32 |
|
|
4xB200_NVL |
Max Throughput |
1024 / 1024 |
64 |
|
|
4xB200_NVL |
Min Latency |
1024 / 8192 |
4 |
|
|
4xB200_NVL |
Low Latency |
1024 / 8192 |
8 |
|
|
4xB200_NVL |
Balanced |
1024 / 8192 |
16 |
|
|
4xB200_NVL |
High Throughput |
1024 / 8192 |
32 |
|
|
4xB200_NVL |
Max Throughput |
1024 / 8192 |
64 |
|
|
4xB200_NVL |
Min Latency |
8192 / 1024 |
4 |
|
|
4xB200_NVL |
Low Latency |
8192 / 1024 |
8 |
|
|
4xB200_NVL |
Balanced |
8192 / 1024 |
16 |
|
|
4xB200_NVL |
High Throughput |
8192 / 1024 |
32 |
|
|
4xB200_NVL |
Max Throughput |
8192 / 1024 |
64 |
|
|
8xB200_NVL |
Min Latency |
1024 / 1024 |
4 |
|
|
8xB200_NVL |
Low Latency |
1024 / 1024 |
8 |
|
|
8xB200_NVL |
Balanced |
1024 / 1024 |
16 |
|
|
8xB200_NVL |
High Throughput |
1024 / 1024 |
32 |
|
|
8xB200_NVL |
Max Throughput |
1024 / 1024 |
64 |
|
|
8xB200_NVL |
Min Latency |
1024 / 8192 |
4 |
|
|
8xB200_NVL |
Low Latency |
1024 / 8192 |
8 |
|
|
8xB200_NVL |
Balanced |
1024 / 8192 |
16 |
|
|
8xB200_NVL |
High Throughput |
1024 / 8192 |
32 |
|
|
8xB200_NVL |
Max Throughput |
1024 / 8192 |
64 |
|
|
8xB200_NVL |
Min Latency |
8192 / 1024 |
4 |
|
|
8xB200_NVL |
Low Latency |
8192 / 1024 |
8 |
|
|
8xB200_NVL |
Balanced |
8192 / 1024 |
16 |
|
|
8xB200_NVL |
High Throughput |
8192 / 1024 |
32 |
|
|
8xB200_NVL |
Max Throughput |
8192 / 1024 |
64 |
|
|
H200_SXM |
Min Latency |
1024 / 1024 |
4 |
|
|
H200_SXM |
Low Latency |
1024 / 1024 |
8 |
|
|
H200_SXM |
Balanced |
1024 / 1024 |
16 |
|
|
H200_SXM |
High Throughput |
1024 / 1024 |
32 |
|
|
H200_SXM |
Max Throughput |
1024 / 1024 |
64 |
|
|
H200_SXM |
Min Latency |
1024 / 8192 |
4 |
|
|
H200_SXM |
Low Latency |
1024 / 8192 |
8 |
|
|
H200_SXM |
Balanced |
1024 / 8192 |
16 |
|
|
H200_SXM |
High Throughput |
1024 / 8192 |
32 |
|
|
H200_SXM |
Max Throughput |
1024 / 8192 |
64 |
|
|
H200_SXM |
Min Latency |
8192 / 1024 |
4 |
|
|
H200_SXM |
Low Latency |
8192 / 1024 |
8 |
|
|
H200_SXM |
Balanced |
8192 / 1024 |
16 |
|
|
H200_SXM |
High Throughput |
8192 / 1024 |
32 |
|
|
H200_SXM |
Max Throughput |
8192 / 1024 |
64 |
|
|
2xH200_SXM |
Min Latency |
1024 / 1024 |
4 |
|
|
2xH200_SXM |
Low Latency |
1024 / 1024 |
8 |
|
|
2xH200_SXM |
Balanced |
1024 / 1024 |
16 |
|
|
2xH200_SXM |
High Throughput |
1024 / 1024 |
32 |
|
|
2xH200_SXM |
Max Throughput |
1024 / 1024 |
64 |
|
|
2xH200_SXM |
Min Latency |
1024 / 8192 |
4 |
|
|
2xH200_SXM |
Low Latency |
1024 / 8192 |
8 |
|
|
2xH200_SXM |
Balanced |
1024 / 8192 |
16 |
|
|
2xH200_SXM |
High Throughput |
1024 / 8192 |
32 |
|
|
2xH200_SXM |
Max Throughput |
1024 / 8192 |
64 |
|
|
2xH200_SXM |
Min Latency |
8192 / 1024 |
4 |
|
|
2xH200_SXM |
Low Latency |
8192 / 1024 |
8 |
|
|
2xH200_SXM |
Balanced |
8192 / 1024 |
16 |
|
|
2xH200_SXM |
High Throughput |
8192 / 1024 |
32 |
|
|
2xH200_SXM |
Max Throughput |
8192 / 1024 |
64 |
|
|
4xH200_SXM |
Min Latency |
1024 / 1024 |
4 |
|
|
4xH200_SXM |
Low Latency |
1024 / 1024 |
8 |
|
|
4xH200_SXM |
Balanced |
1024 / 1024 |
16 |
|
|
4xH200_SXM |
High Throughput |
1024 / 1024 |
32 |
|
|
4xH200_SXM |
Max Throughput |
1024 / 1024 |
64 |
|
|
4xH200_SXM |
Min Latency |
1024 / 8192 |
4 |
|
|
4xH200_SXM |
Low Latency |
1024 / 8192 |
8 |
|
|
4xH200_SXM |
Balanced |
1024 / 8192 |
16 |
|
|
4xH200_SXM |
High Throughput |
1024 / 8192 |
32 |
|
|
4xH200_SXM |
Max Throughput |
1024 / 8192 |
64 |
|
|
4xH200_SXM |
Min Latency |
8192 / 1024 |
4 |
|
|
4xH200_SXM |
Low Latency |
8192 / 1024 |
8 |
|
|
4xH200_SXM |
Balanced |
8192 / 1024 |
16 |
|
|
4xH200_SXM |
High Throughput |
8192 / 1024 |
32 |
|
|
4xH200_SXM |
Max Throughput |
8192 / 1024 |
64 |
|
|
8xH200_SXM |
Min Latency |
1024 / 1024 |
4 |
|
|
8xH200_SXM |
Low Latency |
1024 / 1024 |
8 |
|
|
8xH200_SXM |
Balanced |
1024 / 1024 |
16 |
|
|
8xH200_SXM |
High Throughput |
1024 / 1024 |
32 |
|
|
8xH200_SXM |
Max Throughput |
1024 / 1024 |
64 |
|
|
8xH200_SXM |
Min Latency |
1024 / 8192 |
4 |
|
|
8xH200_SXM |
Low Latency |
1024 / 8192 |
8 |
|
|
8xH200_SXM |
Balanced |
1024 / 8192 |
16 |
|
|
8xH200_SXM |
High Throughput |
1024 / 8192 |
32 |
|
|
8xH200_SXM |
Max Throughput |
1024 / 8192 |
64 |
|
|
8xH200_SXM |
Min Latency |
8192 / 1024 |
4 |
|
|
8xH200_SXM |
Low Latency |
8192 / 1024 |
8 |
|
|
8xH200_SXM |
Balanced |
8192 / 1024 |
16 |
|
|
8xH200_SXM |
High Throughput |
8192 / 1024 |
32 |
|
|
8xH200_SXM |
Max Throughput |
8192 / 1024 |
64 |
|