Benchmarking Default Performance#

This section discusses how to build an engine for the model using the LLM-API and benchmark it using TRTLLM-Bench.

Disclaimer: While performance numbers shown here are real, they are only for demonstration purposes. Differences in environment, SKU, interconnect, and workload can all significantly affect performance and lead to your results differing from what is shown here.

Before You Begin: TensorRT-LLM LLM-API#

TensorRT-LLM’s LLM-API aims to make getting started with TensorRT-LLM quick and easy. For example, the following script instantiates Llama-3.3-70B-Instruct and runs inference on a small set of prompts. For those familiar with TensorRT-LLM’s CLI workflow, the call to LLM() handles converting the model checkpoint and building the engine in one line.

#quickstart.py
from tensorrt_llm import LLM, SamplingParams


def main():
    prompts = [
        "Hello, I am",
        "The president of the United States is",
        "The capital of France is",
        "The future of AI is",
    ]

    sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

    llm  =  LLM(
    model="meta-llama/Llama-3.3-70B-Instruct", #HuggingFace model name, no need to download the checkpoint beforehand
    tensor_parallel_size=4
    )

    outputs = llm.generate(prompts, sampling_params)

    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

if __name__ == '__main__':
    main()

Troubleshooting Tips and Pitfalls To Avoid#

Since we are running on multiple GPUs, MPI is used to spawn processes for each GPU. This raises the following requirements

The entrypoint to the script should be guarded via if __name__ == '__main__'. This requirement comes from mpi4py.
Depending on your environment, it might be required to wrap the python command with mpirun. For example the command to run the script above could be mpirun -n 1 --oversubscribe --allow-run-as-root python quickstart.py. For running on multiple GPUs on one node like the example is attempting to do it is usually not required to prefix with mpirun but if you are getting MPI errors then you should add it. Additionally, the -n 1 which says just to launch one process is intentional as TensorRT-LLM handles spawning the processes for the remaining GPUs
If you get a HuggingFace access error when loading the Llama weights, this is likely because the model is gated. Request access on the HuggingFace page for the model. Then follow the instructions on Huggingface’s quickstart guide to authenticate in your environment.

Building and Saving the Engine#

Save the engine using .save(). Just like the previous example, this script and all subsequent scripts might need to be run via mpirun.

from tensorrt_llm import LLM

def main():
    llm = LLM(
        model="/scratch/Llama-3.3-70B-Instruct",
        tensor_parallel_size=4
    )

    llm.save("baseline")

if __name__ == '__main__':
    main()

Building and Saving Engines via CLI#

TensorRT-LLM also has a command line interface for building and saving engines. This workflow consists of two steps

Convert model checkpoint (HuggingFace, Nemo) to TensorRT-LLM checkpoint via convert_checkpoint.py. Each supported model has a convert_checkpoint.py associated it with it and can be found in the examples folder. For example, the convert_checkpoint.py script for Llama models can be found here
Build engine by passing TensorRT-LLM checkpoint to trtllm-build command. The trtllm-build command is installed automatically when the tensorrt_llm package is installed.

The README in the examples folder for supported models walks through building engines using this flow for a wide variety of situations. The examples folder for Llama models can be found at NVIDIA/TensorRT-LLM.

Benchmarking with `trtllm-bench`#

trtllm-bench provides a command line interface for benchmarking the throughput and latency of saved engines.

Prepare Dataset#

trtllm-bench expects to be passed in a dataset of requests to run through the model. This guide creates a dummy dataset of 1000 requests with every request having input and output sequence length of 2048. TensorRT-LLM provides the prepare_dataset.py script to produce the dataset. To use it clone the TensorRT-LLM Repo and run the following command:

python benchmarks/cpp/prepare_dataset.py --stdout --tokenizer /path/to/hf/Llama-3.3-70B-Instruct/ token-norm-dist --input-mean 2048 --output-mean 2048 --input-stdev 0 --output-stdev 0 --num-requests 1000 > synthetic_2048_2048.txt

trtllm-bench can also take in real data, see trtllm-bench documentation for more details on the required format.

Running Throughput and Latency Benchmarks#

To benchmark the baseline engine built in the previous script, run the following commands. Again, due to the multi-gpu nature of the workload you may need prefix the trtllm-bench command with mpirun -n 1 --oversubscribe --allow-run-as-root.

Throughput

trtllm-bench \
--model /path/to/hf/Llama-3.3-70B-Instruct/ \
throughput \
--dataset /path/to/dataset/synthetic_2048_2048_1000.txt \
--engine_dir /path/to/engines/baseline #replace baseline with name used in llm.save()

This command will send all 1000 requests to the model immediately. Run trtllm-bench throughput -h to see a list of options that help you control the request rate and cap the total number of requests if the benchmark is taking too long. For reference, internal testing of the above command took around 20 minutes on a 4 NVLink connected H100-sxm-80GB.

Running this command will provide a throughput overview like this:

===========================================================
= ENGINE DETAILS
===========================================================
Model:			/scratch/Llama-3.3-70B-Instruct/
Engine Directory:	/scratch/grid_search_engines/baseline
TensorRT-LLM Version:	0.16.0
Dtype:			bfloat16
KV Cache Dtype:		None
Quantization:		None
Max Sequence Length:	131072

===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:		4
PP Size:		1
Max Runtime Batch Size:	2048
Max Runtime Tokens:	8192
Scheduling Policy:	Guaranteed No Evict
KV Memory Percentage:	90.00%
Issue Rate (req/sec):	7.9353E+13

===========================================================
= PERFORMANCE OVERVIEW
===========================================================
Number of requests:		1000
Average Input Length (tokens):	2048.0000
Average Output Length (tokens):	2048.0000
Token Throughput (tokens/sec):	1585.7480
Request Throughput (req/sec):	0.7743
Total Latency (ms):		1291504.1051

===========================================================

Latency

trtllm-bench \
--model /path/to/hf/Llama-3.3-70B-Instruct/ \
latency \
--dataset /path/to/dataset/synthetic_2048_2048_1000.txt \
--num-requests 100 \
--warmup 10 \
--engine_dir /path/to/engines/baseline #replace baseline with name used in llm.save()

The latency benchmark enforces a batch size of 1 to accurately measure latency, which can significantly increase testing duration. In the example above the total number of requests is limited to 100 via --num-requests to make the test duration more manageable. This example benchmark was designed to produce very stable numbers, but in real scenarios even 100 requests is likely more than you need and can take a long time to complete (in the case-study it took about an hour and a half). Reducing the number of requests to 10 would still provide accurate data and enable faster development iterations. In general you should adjust the number of requests per your needs. Run trtllm-bench latency -h to see other configurable options.

Running this command will provide a latency overview like this:

===========================================================
= ENGINE DETAILS
===========================================================
Model:			/scratch/Llama-3.3-70B-Instruct/
Engine Directory:	/scratch/grid_search_engines/baseline
TensorRT-LLM Version:	0.16.0
Dtype:			bfloat16
KV Cache Dtype:		None
Quantization:		None
Max Input Length:	1024
Max Sequence Length:	131072

===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:		4
PP Size:		1
Max Runtime Batch Size:	1
Max Runtime Tokens:	8192
Scheduling Policy:	Guaranteed No Evict
KV Memory Percentage:	90.00%

===========================================================
= GENERAL OVERVIEW
===========================================================
Number of requests:		100
Average Input Length (tokens):	2048.0000
Average Output Length (tokens):	2048.0000
Average request latency (ms):	63456.0704

===========================================================
= THROUGHPUT OVERVIEW
===========================================================
Request Throughput (req/sec):		  0.0158
Total Token Throughput (tokens/sec):	  32.2742
Generation Token Throughput (tokens/sec): 32.3338

===========================================================
= LATENCY OVERVIEW
===========================================================
Total Latency (ms):		  6345624.0554
Average time-to-first-token (ms): 147.7502
Average inter-token latency (ms): 30.9274
Acceptance Rate (Speculative):	  1.00

===========================================================
= GENERATION LATENCY BREAKDOWN
===========================================================
MIN (ms): 63266.8804
MAX (ms): 63374.7770
AVG (ms): 63308.3201
P90 (ms): 63307.1885
P95 (ms): 63331.7136
P99 (ms): 63374.7770

===========================================================
= ACCEPTANCE BREAKDOWN
===========================================================
MIN: 1.00
MAX: 1.00
AVG: 1.00
P90: 1.00
P95: 1.00
P99: 1.00

===========================================================

Results#

The baseline engine achieves the following performance for token throughput, request throughput, average time to first token, and average inter-token latency. These metrics will be analyzed throughout the guide.

Metric	Value
Token Throughput (tokens/sec)	1564.3040
Request Throughput (req/sec)	0.7638
Average Time To First Token (ms)	147.6976
Average Inter-Token Latency (ms)	31.3276

The following sections show ways you can improve these metrics using different configuration options.