TensorRT LLM Benchmarking#
Important
This benchmarking suite is a work in progress. Expect breaking API changes.
TensorRT LLM provides the trtllm-bench
CLI, a packaged benchmarking utility that aims to make it
easier for users to reproduce our officially published performance overview. trtllm-bench
provides the follows:
A streamlined way to build tuned engines for benchmarking for a variety of models and platforms.
An entirely Python workflow for benchmarking.
Ability to benchmark various flows and features within TensorRT LLM.
trtllm-bench
executes all benchmarks using [in-flight batching] – for more information see
the in-flight batching section that describes the concept
in further detail.
Before Benchmarking#
For rigorous benchmarking where consistent and reproducible results are critical, proper GPU configuration is essential. These settings help maximize GPU utilization, eliminate performance variability, and ensure optimal conditions for accurate measurements. While not strictly required for normal operation, we recommend applying these configurations when conducting performance comparisons or publishing benchmark results.
Persistence mode#
Ensure persistence mode is enabled to maintain consistent GPU state:
sudo nvidia-smi -pm 1
GPU Clock Management#
Allow the GPU to dynamically adjust its clock speeds based on workload and temperature. While locking clocks at maximum frequency might seem beneficial, it can sometimes lead to thermal throttling and reduced performance. Reset GPU clocks using:
sudo nvidia-smi -rgc
Set power limits#
First query the maximum power limit:
nvidia-smi -q -d POWER
Then configure the GPU to operate at its maximum power limit for consistent performance:
sudo nvidia-smi -pl <max_power_limit>
Boost settings#
Potentially a GPU may support boost levels. First query available boost levels:
sudo nvidia-smi boost-slider -l
If supported, enable the boost slider using one of the available levels for maximum performance:
sudo nvidia-smi boost-slider --vboost <max_boost_slider>
Throughput Benchmarking#
Limitations and Caveats#
Validated Networks for Benchmarking#
While trtllm-bench
should be able to run any network that TensorRT LLM supports, the following are the list
that have been validated extensively and is the same listing as seen on the
Performance Overview page.
Tip
trtllm-bench
can automatically download the model from Hugging Face Model Hub.
Export your token in the HF_TOKEN
environment variable.
Supported Quantization Modes#
trtllm-bench
supports the following quantization modes:
None (no quantization applied)
FP8
NVFP4
For more information about quantization, refer to Numerical Precision and the support matrix of the supported quantization methods for each network.
Tip
Although TensorRT LLM supports more quantization modes than listed above, trtllm-bench
currently only configures for
a smaller subset.
Preparing a Dataset#
The throughput benchmark utilizes a fixed JSON schema to specify requests. The schema is defined as follows:
Key |
Required |
Type |
Description |
---|---|---|---|
|
Y |
String |
Unique identifier for the request. |
|
N* |
String |
Input text for a generation request. |
|
Y* |
List[Integer] |
List of logits that make up the request prompt. |
|
Y |
Integer |
Number of generated tokens for this request. |
Tip
* Specifying prompt
or input_ids
is required. However, you can not have both prompts and logits (input_ids
)
defined at the same time. If you specify input_ids
, the prompt
entry is ignored for request generation.
Refer to the following examples of valid entries for the benchmark:
Entries with a human-readable prompt and no logits.
{"task_id": 1, "prompt": "Generate an infinite response to the following: This is the song that never ends, it goes on and on my friend.", "output_tokens": 1000} {"task_id": 2, "prompt": "Generate an infinite response to the following: Na, na, na, na", "output_tokens": 1000}
Entries which contain logits.
{"task_id":0,"input_ids":[863,22056,25603,11943,8932,13195,3132,25032,21747,22213],"output_tokens":128} {"task_id":1,"input_ids":[14480,13598,15585,6591,1252,8259,30990,26778,7063,30065,21764,11023,1418],"output_tokens":128}
Tip
Specify each entry on one line. To simplify passing the data, a complete JSON entry is on each line so that the benchmarker can simply read a line and assume a complete entry. When creating a dataset, be sure that a complete JSON entry is on every line.
In order to prepare a synthetic dataset, you can use the provided script in the benchmarks/cpp
directory. For example, to generate a synthetic dataset of 1000 requests with a uniform ISL/OSL of
128/128 for meta-llama/Llama-3.1-8B, run:
python benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-3.1-8B token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 1000 > /tmp/synthetic_128_128.txt
Running with the PyTorch Workflow#
To benchmark the PyTorch backend (tensorrt_llm._torch
), use the following command with dataset generated from previous steps. The throughput
benchmark initializes the backend by tuning against the
dataset provided via --dataset
(or the other build mode settings described above).
Note that CUDA graph is enabled by default. You can add additional pytorch config with
--extra_llm_api_options
followed by the path to a YAML file. For more details, please refer to the
help text by running the command with --help
.
Tip
The command below specifies the --model_path
option. The model path is optional and used only when you want to run a locally
stored checkpoint. When using --model_path
, the --model
is still required for reporting reasons and in order to look up parameters
for build heuristics.
trtllm-bench --model meta-llama/Llama-3.1-8B \
--model_path /Ckpt/Path/To/Llama-3.1-8B \
throughput \
--dataset /tmp/synthetic_128_128.txt \
--backend pytorch
# Example output
<snip verbose logging>
===========================================================
= PyTorch backend
===========================================================
Model: meta-llama/Llama-3.1-8B
Model Path: /Ckpt/Path/To/Llama-3.1-8B
TensorRT-LLM Version: 0.17.0
Dtype: bfloat16
KV Cache Dtype: None
Quantization: FP8
===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size: 1
PP Size: 1
Max Runtime Batch Size: 2048
Max Runtime Tokens: 4096
Scheduling Policy: Guaranteed No Evict
KV Memory Percentage: 90.00%
Issue Rate (req/sec): 7.6753E+14
===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Number of requests: 3000
Average Input Length (tokens): 128.0000
Average Output Length (tokens): 128.0000
Token Throughput (tokens/sec): 20685.5510
Request Throughput (req/sec): 161.6059
Total Latency (ms): 18563.6825
When enabling streaming, time to first token (TTFT) and inter-token latency (ITL) metrics will also be recorded.
trtllm-bench --model meta-llama/Llama-3.1-8B \
--model_path /Ckpt/Path/To/Llama-3.1-8B \
throughput \
--dataset /tmp/synthetic_128_128.txt \
--backend pytorch
Alternatively, users can benchmark the low latency mode:
trtllm-bench --model meta-llama/Llama-3.1-8B \
--model_path /Ckpt/Path/To/Llama-3.1-8B \
latency \
--dataset /tmp/synthetic_128_128.txt \
--backend pytorch
Benchmarking with LoRA Adapters in PyTorch workflow#
The PyTorch workflow supports benchmarking with LoRA (Low-Rank Adaptation) adapters. This requires preparing a dataset with LoRA metadata and configuring the LoRA settings.
Preparing LoRA Dataset
Use prepare_dataset.py
with LoRA-specific options to generate requests with LoRA metadata:
python3 benchmarks/cpp/prepare_dataset.py \
--stdout \
--rand-task-id 0 1 \
--tokenizer /path/to/tokenizer \
--lora-dir /path/to/loras \
token-norm-dist \
--num-requests 100 \
--input-mean 128 \
--output-mean 128 \
--input-stdev 16 \
--output-stdev 24 \
> synthetic_lora_data.json
Key LoRA options:
--lora-dir
: Parent directory containing LoRA adapter subdirectories named by their task IDs (e.g.,0/
,1/
, etc.)--rand-task-id
: Range of LoRA task IDs to randomly assign to requests--task-id
: Fixed LoRA task ID for all requests (alternative to--rand-task-id
)
The generated dataset will include LoRA request metadata. Below is an example of a single such request data entry:
{
"task_id": 0,
"input_ids": [3452, 88226, 102415, ...],
"output_tokens": 152,
"lora_request": {
"lora_name": "lora_0",
"lora_int_id": 0,
"lora_path": "/path/to/loras/0"
}
}
LoRA Configuration
Create an extra-llm-api-options.yaml
file with LoRA configuration:
lora_config:
lora_dir:
- /path/to/loras/0
- /path/to/loras/1
max_lora_rank: 64
lora_target_modules:
- attn_q
- attn_k
- attn_v
trtllm_modules_to_hf_modules:
attn_q: q_proj
attn_k: k_proj
attn_v: v_proj
Running LoRA Benchmark
trtllm-bench --model /path/to/base/model \
throughput \
--dataset synthetic_lora_data.json \
--backend pytorch \
--extra_llm_api_options extra-llm-api-options.yaml
Note
The LoRA directory structure should have task-specific subdirectories named by their task IDs (e.g., loras/0/
, loras/1/
).
Each subdirectory should contain the LoRA adapter files for that specific task.
Running multi-modal models in the PyTorch Workflow#
To benchmark multi-modal models with PyTorch workflow, you can follow the similar approach as above.
First, prepare the dataset:
python ./benchmarks/cpp/prepare_dataset.py \
--tokenizer Qwen/Qwen2-VL-2B-Instruct \
--stdout \
dataset \
--dataset-name lmms-lab/MMMU \
--dataset-split test \
--dataset-image-key image \
--dataset-prompt-key question \
--num-requests 10 \
--output-len-dist 128,5 > mm_data.jsonl
It will download the media files to /tmp
directory and prepare the dataset with their paths. Note that the prompt
fields are texts and not tokenized ids. This is due to the fact that
the prompt
and the media (image/video) are processed by a preprocessor for multimodal files.
Sample dataset for multimodal:
{"task_id":0,"prompt":"Brahma Industries sells vinyl replacement windows to home improvement retailers nationwide. The national sales manager believes that if they invest an additional $25,000 in advertising, they would increase sales volume by 10,000 units. <image 1> What is the total contribution margin?","media_paths":["/tmp/tmp9so41y3r.jpg"],"output_tokens":126}
{"task_id":1,"prompt":"Let us compute for the missing amounts under work in process inventory, what is the cost of goods manufactured? <image 1>","media_paths":["/tmp/tmpowsrb_f4.jpg"],"output_tokens":119}
{"task_id":2,"prompt":"Tsuji is reviewing the price of a 3-month Japanese yen/U.S. dollar currency futures contract, using the currency and interest rate data shown below. Because the 3-month Japanese interest rate has just increased to .50%, Itsuji recognizes that an arbitrage opportunity exists nd decides to borrow $1 million U.S. dollars to purchase Japanese yen. Calculate the yen arbitrage profit from Itsuji's strategy, using the following data: <image 1> ","media_paths":["/tmp/tmpxhdvasex.jpg"],"output_tokens":126}
...
Run the benchmark:
trtllm-bench --model Qwen/Qwen2-VL-2B-Instruct \
throughput \
--dataset mm_data.jsonl \
--backend pytorch \
--num_requests 10 \
--max_batch_size 4 \
--modality image
Sample output:
===========================================================
= REQUEST DETAILS
===========================================================
Number of requests: 10
Number of concurrent requests: 5.3019
Average Input Length (tokens): 411.6000
Average Output Length (tokens): 128.7000
===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size: 1
PP Size: 1
EP Size: None
Max Runtime Batch Size: 4
Max Runtime Tokens: 12288
Scheduling Policy: GUARANTEED_NO_EVICT
KV Memory Percentage: 90.00%
Issue Rate (req/sec): 1.4117E+17
===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Request Throughput (req/sec): 1.4439
Total Output Throughput (tokens/sec): 185.8351
Per User Output Throughput (tokens/sec/user): 38.1959
Per GPU Output Throughput (tokens/sec/gpu): 185.8351
Total Token Throughput (tokens/sec): 780.1607
Total Latency (ms): 6925.4963
Average request latency (ms): 3671.8441
-- Request Latency Breakdown (ms) -----------------------
[Latency] P50 : 3936.3022
[Latency] P90 : 5514.4701
[Latency] P95 : 5514.4701
[Latency] P99 : 5514.4701
[Latency] MINIMUM: 2397.1047
[Latency] MAXIMUM: 5514.4701
[Latency] AVERAGE: 3671.8441
===========================================================
= DATASET DETAILS
===========================================================
Dataset Path: /workspaces/tensorrt_llm/mm_data.jsonl
Number of Sequences: 10
-- Percentiles statistics ---------------------------------
Input Output Seq. Length
-----------------------------------------------------------
MIN: 167.0000 119.0000 300.0000
MAX: 1059.0000 137.0000 1178.0000
AVG: 411.6000 128.7000 540.3000
P50: 299.0000 128.0000 427.0000
P90: 1059.0000 137.0000 1178.0000
P95: 1059.0000 137.0000 1178.0000
P99: 1059.0000 137.0000 1178.0000
===========================================================
Notes and Limitations:
Only image datasets are supported for now.
--output-len-dist
is a required argument for multimodal datasets.Tokenizer is unused during the prepare step but it is still a required argument.
Since the images are converted to tokens when the model is run,
trtllm-bench
uses a default large value for the maximum input sequence length when setting up the execution settings. You can also modify the behavior by specifying a different value with the flag--max_input_len
that suits your use-case.
Quantization in the PyTorch Flow#
To run a quantized benchmark with trtllm-bench
utilizing the PyTorch flow, you will need to use a pre-quantized
checkpoint. For the Llama-3.1 models, TensorRT LLM provides the following checkpoints via HuggingFace:
To understand more about how to quantize your own checkpoints, refer to ModelOpt documentation.
trtllm-bench
utilizes the hf_quant_config.json
file present in the pre-quantized checkpoints above. The configuration
file is present in checkpoints quantized with TensorRT Model Optimizer
and describes the compute and KV cache quantization that checkpoint was compiled with. For example, from the checkpoints
above:
{
"producer": {
"name": "modelopt",
"version": "0.23.0rc1"
},
"quantization": {
"quant_algo": "FP8",
"kv_cache_quant_algo": null
}
}
The checkpoints above are quantized to run with a compute precision of FP8
and default to no KV cache quantization (full
FP16
cache). When running trtllm-bench throughput
. The benchmark will select a KV cache quantization that is best suited
for the compute precision in the checkpoint automatically if kv_cache_quant_algo
is specified as null
, otherwise it will
be forced to match the specified non-null KV cache quantization. The following are the mappings that trtllm-bench
will
follow when a checkpoint does not specify a KV cache quantization algorithm:
Checkpoint Compute Quant |
Checkpoint KV Cache Quant |
|
Note |
---|---|---|---|
|
|
|
In this case, a quantization config doesn’t exist. |
|
|
|
Matches the checkpoint |
|
|
|
Set to |
|
|
|
Set to |
If you would like to force the KV cache quantization, you can specify the following in the YAML file to force the precision
when the checkpoint precision is null
:
kv_cache_dtype: "fp8"
Tip
The two valid values for kv_cache_dtype
are auto
and fp8
.