trtllm-eval#
About#
The trtllm-eval
command provides developers with a unified entry point for accuracy evaluation. It shares the core evaluation logic with the accuracy test suite of TensorRT LLM.
trtllm-eval
is built on the offline API – LLM API. Compared to the online trtllm-serve
, the offline API provides clearer error messages and simplifies the debugging workflow.
The following tasks are currently supported:
Dataset |
Task |
Metric |
Default ISL |
Default OSL |
---|---|---|---|---|
CNN Dailymail |
summarization |
rouge |
924 |
100 |
MMLU |
QA; multiple choice |
accuracy |
4,094 |
2 |
GSM8K |
QA; regex matching |
accuracy |
4,096 |
256 |
GPQA |
QA; multiple choice |
accuracy |
32,768 |
4,096 |
JSON mode eval |
structured generation |
accuracy |
1,024 |
512 |
Note
trtllm-eval
originates from the TensorRT LLM accuracy test suite and serves as a lightweight utility for verifying and debugging accuracy. At this time, trtllm-eval
is intended solely for development and is not recommended for production use.
Usage and Examples#
Some evaluation tasks (e.g., GSM8K and GPQA) depend on the lm_eval
package. To run these tasks, you need to install lm_eval
with:
pip install -r requirements-dev.txt
Alternatively, you can install the lm_eval
version specified in requirements-dev.txt
.
Here are some examples:
# Evaluate Llama-3.1-8B-Instruct on MMLU
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct mmlu
# Evaluate Llama-3.1-8B-Instruct on GSM8K
trtllm-eval --model meta-llama/Llama-3.1-8B-Instruct gsm8k
# Evaluate Llama-3.3-70B-Instruct on GPQA Diamond
trtllm-eval --model meta-llama/Llama-3.3-70B-Instruct gpqa_diamond
The --model
argument accepts either a Hugging Face model ID or a local checkpoint path. By default, trtllm-eval
runs the model with the PyTorch backend; you can pass --backend tensorrt
to switch to the TensorRT backend.
Alternatively, the --model
argument also accepts a local path to pre-built TensorRT engines. In this case, you should pass the Hugging Face tokenizer path to the --tokenizer
argument.
For more details, see trtllm-eval --help
and trtllm-eval <task> --help
.
Syntax#
trtllm-eval#
trtllm-eval [OPTIONS] COMMAND [ARGS]...
Options
- --model <model>#
Required model name | HF checkpoint path | TensorRT engine path
- --tokenizer <tokenizer>#
Path | Name of the tokenizer.Specify this value only if using TensorRT engine as model.
- --backend <backend>#
Set to ‘pytorch’ for pytorch path. Default is cpp path.
- Options:
pytorch | tensorrt
- --log_level <log_level>#
The logging level.
- Options:
internal_error | error | warning | info | verbose | debug | trace
- --max_beam_width <max_beam_width>#
Maximum number of beams for beam search decoding.
- --max_batch_size <max_batch_size>#
Maximum number of requests that the engine can schedule.
- --max_num_tokens <max_num_tokens>#
Maximum number of batched input tokens after padding is removed in each batch.
- --max_seq_len <max_seq_len>#
Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config.
- --tp_size <tp_size>#
Tensor parallelism size.
- --pp_size <pp_size>#
Pipeline parallelism size.
- --ep_size <ep_size>#
expert parallelism size
- --gpus_per_node <gpus_per_node>#
Number of GPUs per node. Default to None, and it will be detected automatically.
- --kv_cache_free_gpu_memory_fraction <kv_cache_free_gpu_memory_fraction>#
Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers.
- --trust_remote_code#
Flag for HF transformers.
- --extra_llm_api_options <extra_llm_api_options>#
Path to a YAML file that overwrites the parameters
cnn_dailymail#
trtllm-eval cnn_dailymail [OPTIONS]
Options
- --dataset_path <dataset_path>#
The path to CNN Dailymail dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples <num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed <random_seed>#
Random seed for dataset processing.
- --rouge_path <rouge_path>#
The path to rouge repository.If unspecified, the repository is downloaded from HF hub.
- --apply_chat_template#
Whether to apply chat template.
- --system_prompt <system_prompt>#
System prompt.
- --max_input_length <max_input_length>#
Maximum prompt length.
- --max_output_length <max_output_length>#
Maximum generation length.
gpqa_diamond#
trtllm-eval gpqa_diamond [OPTIONS]
Options
- --dataset_path <dataset_path>#
The path to GPQA dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples <num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed <random_seed>#
Random seed for dataset processing.
- --apply_chat_template#
Whether to apply chat template.
- --system_prompt <system_prompt>#
System prompt.
- --max_input_length <max_input_length>#
Maximum prompt length.
- --max_output_length <max_output_length>#
Maximum generation length.
gpqa_extended#
trtllm-eval gpqa_extended [OPTIONS]
Options
- --dataset_path <dataset_path>#
The path to GPQA dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples <num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed <random_seed>#
Random seed for dataset processing.
- --apply_chat_template#
Whether to apply chat template.
- --system_prompt <system_prompt>#
System prompt.
- --max_input_length <max_input_length>#
Maximum prompt length.
- --max_output_length <max_output_length>#
Maximum generation length.
gpqa_main#
trtllm-eval gpqa_main [OPTIONS]
Options
- --dataset_path <dataset_path>#
The path to GPQA dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples <num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed <random_seed>#
Random seed for dataset processing.
- --apply_chat_template#
Whether to apply chat template.
- --system_prompt <system_prompt>#
System prompt.
- --max_input_length <max_input_length>#
Maximum prompt length.
- --max_output_length <max_output_length>#
Maximum generation length.
gsm8k#
trtllm-eval gsm8k [OPTIONS]
Options
- --dataset_path <dataset_path>#
The path to GSM8K dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples <num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed <random_seed>#
Random seed for dataset processing.
- --apply_chat_template#
Whether to apply chat template.
- --fewshot_as_multiturn#
Apply fewshot as multiturn.
- --system_prompt <system_prompt>#
System prompt.
- --max_input_length <max_input_length>#
Maximum prompt length.
- --max_output_length <max_output_length>#
Maximum generation length.
json_mode_eval#
trtllm-eval json_mode_eval [OPTIONS]
Options
- --dataset_path <dataset_path>#
The path to JSON Mode Eval dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples <num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed <random_seed>#
Random seed for dataset processing.
- --system_prompt <system_prompt>#
System prompt.
- --max_input_length <max_input_length>#
Maximum prompt length.
- --max_output_length <max_output_length>#
Maximum generation length.
mmlu#
trtllm-eval mmlu [OPTIONS]
Options
- --dataset_path <dataset_path>#
The path to MMLU dataset. The commands to prepare the dataset: wget https://people.eecs.berkeley.edu/~hendrycks/data.tar && tar -xf data.tar. If unspecified, the dataset is downloaded automatically.
- --num_samples <num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --num_fewshot <num_fewshot>#
Number of fewshot.
- --random_seed <random_seed>#
Random seed for dataset processing.
- --apply_chat_template#
Whether to apply chat template.
- --system_prompt <system_prompt>#
System prompt.
- --max_input_length <max_input_length>#
Maximum prompt length.
- --max_output_length <max_output_length>#
Maximum generation length.
- --check_accuracy#
- --accuracy_threshold <accuracy_threshold>#
mmmu#
trtllm-eval mmmu [OPTIONS]
Options
- --dataset_path <dataset_path>#
The path to MMMU dataset. If unspecified, the dataset is downloaded from HF hub.
- --num_samples <num_samples>#
Number of samples to run the evaluation; None means full dataset.
- --random_seed <random_seed>#
Random seed for dataset processing.
- --system_prompt <system_prompt>#
The system prompt to be added on the prompt. If specified, it will add {‘role’: ‘system’, ‘content’: system_prompt} to the prompt.
- --max_input_length <max_input_length>#
Maximum prompt length.
- --max_output_length <max_output_length>#
Maximum generation length.