[!IMPORTANT] As of TensorRT-LLM v0.10, these performance benchmarks have changed methodology to utilize in-flight batching and no longer utilize static benchmarking. These numbers are initial measurements and are expected to improve in future releases.

Overview

This document summarizes performance measurements of TensorRT-LLM on H100 (Hopper), L40S (Ada) and A100 (Ampere) GPUs for a few key models.

The data in the following tables is provided as a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by TensorRT-LLM.

Known Issues

The following issues are being addressed to improve the efficiency of TensorRT-LLM.

Fused Matmul + Gated-SiLU (LLaMA)

The current implementation combines two Matmul operations into one Matmul followed by a separate SwiGLU kernel (when --use_fused_mlp=enable is enabled). There is also a more efficient implementation that runs single Matmul + SwiGLU fused kernel for FP8 on Hopper (when --use_fused_mlp=enable --gemm_swiglu_plugin fp8 is enabled). The gemm_swiglu_plugin will support more data types and GPU architectures in the future release.

Throughput Measurements

The below table shows performance data where a local inference client is fed requests at an infinite rate (no delay between messages), and shows the throughput client-server scenario under maximum load.

The performance numbers below were collected using the steps described in this document.

All data in the table below was generated using version 0.14.0 and presents token throughput in tokens/second.

GPU

H200 141GB HBM3

H100 80GB HBM3

H100 80GB HBM3

A100-SXM4-80GB

A100-PCIE-80GB

L40S

Precision

FP8

FP8

FP16

FP16

FP16

FP8

Model

Input/Output Lengths

TP Size

LLaMA v3 70B

1000/1000

1

2594.2199

464.5243

2

4574.1197

4092.3267

776.9965

468.5805

259.1155

4

7612.2487

6925.0844

3730.2064

1765.9123

987.1971

1159.357

8

13075.5194

10733.0804

5963.0914

3054.8915

960.3737

1173.3517

128/128

1

3904.1639

2551.6384

2

5343.8677

5191.7428

3183.9714

1334.903

806.1477

4

8829.1049

8540.5362

5837.9598

2421.4383

1275.5474

1427.9115

8

16359.1322

15498.2004

10597.6556

4474.1621

1223.1747

1377.473

128/2048

1

3613.7474

418.3639

2

7112.2959

5852.0185

817.52

511.6257

4

12772.8148

8998.3742

5072.0345

2484.2018

1471.9105

1771.4437

8

19722.5974

15099.0633

7554.2141

4463.6602

1589.1759

1953.7918

128/4096

1

2409.6881

2

5687.3482

3513.0941

413.3767

273.5871

4

8937.3115

6718.5895

3093.7358

1688.0132

1231.8104

1279.2496

8

13976.1386

9279.1013

5001.2743

2948.5374

1350.794

1494.0776

2048/128

1

457.5772

241.7561

2

699.5582

690.9961

328.0399

145.088

91.1746

4

1035.6523

1008.8318

670.6725

278.5717

150.2619

168.7886

8

2055.7245

1996.2653

1288.7599

546.9599

140.0144

160.2741

2048/2048

1

1802.1116

204.0931

2

3487.2497

2444.6903

165.6522

126.1101

4

6126.7196

4850.8285

2386.6556

1230.1833

822.2269

876.6085

8

9784.0193

7432.6659

3991.2123

2144.3042

883.4809

994.94

500/2000

1

2822.7846

389.8823

2

6175.7623

4601.857

687.5386

430.6093

4

10783.8925

9018.9053

3698.3674

2113.3936

1248.8319

1468.7827

8

17631.9756

11375.9582

6321.3679

3673.5693

1321.8541

1636.4588

5000/500

1

532.2603

123.8543

2

931.8255

897.4263

227.9005

117.5698

75.35

4

1399.7865

1316.2865

831.2804

362.3465

209.8052

234.7343

8

2725.1283

2469.5585

1446.3508

662.5725

202.0719

231.9027

LLaMA v3.1 405B

1000/1000

8

3391.0372

128/128

8

3766.2785

128/2048

8

5952.1416

128/4096

8

3944.117

20000/2000

8

481.5732

2048/128

8

444.5735

2048/2048

8

2604.8557

500/2000

8

4805.86

5000/500

8

655.9754

LLaMA v3.1 70B

1000/1000

1

2585.0953

410.286

2

4600.9616

4116.4444

785.4931

468.6383

257.972

4

7607.5304

6932.8808

3774.676

1762.6831

989.4082

1161.4814

8

13081.434

10730.156

5978.4573

3190.0211

959.8463

1188.1193

128/128

1

3897.2623

2459.6003

2

5357.0227

5194.8171

3207.2866

1346.9692

806.7215

4

8826.9618

8542.3012

5846.8413

2420.8665

1272.6755

1438.0446

8

16382.9807

15533.1169

10649.4968

4572.3445

1212.0566

1381.7051

128/2048

1

3612.2603

445.7773

2

7054.7235

5869.3998

822.1912

483.1299

4

12763.4114

9017.4377

4982.6225

2492.4036

1435.236

1763.522

8

19266.0398

15190.1652

7605.5295

4254.2871

1609.2473

1944.1251

128/4096

1

2415.1981

2

5671.9561

3518.782

419.0178

272.9137

4

8939.8227

6431.2702

3083.8794

1685.9677

1212.5416

1280.3778

8

13974.2854

9168.709

4981.9765

3067.5452

1310.091

1499.2441

20000/2000

1

240.7202

2

614.318

397.6801

4

1030.9528

851.8542

369.4269

179.5181

126.7676

140.5565

8

1898.9762

1354.5333

362.9368

156.5767

141.1584

2048/128

1

458.1948

244.1842

2

692.3911

697.3907

322.7016

144.7921

95.0306

4

1034.5773

1001.0771

688.0344

278.4018

150.6795

169.0386

8

2070.8157

1966.6072

1316.3086

550.4751

142.6166

163.6749

2048/2048

1

1797.6743

209.1707

2

3518.0774

2445.0093

166.792

126.1127

4

6112.9026

4838.5272

2393.1359

1231.0359

823.4777

876.2254

8

9716.1934

7434.8117

4023.6978

2171.5323

858.6602

1001.3649

500/2000

1

2826.6665

2

6106.5855

4605.9226

700.5415

430.6129

4

10816.8283

9205.3766

3781.082

2096.2441

1176.418

1470.0826

8

17693.705

13109.4437

6205.2658

3486.7891

1306.35

1639.2778

5000/500

1

533.6128

125.4236

2

936.7014

886.6758

228.874

116.9529

76.1601

4

1386.4827

1313.893

849.1091

362.9361

209.2045

236.117

8

2711.5057

2444.9643

1420.5163

670.3742

203.8008

230.3084

LLaMA v3.1 8B

1000/1000

1

16414.6988

14108.0361

7054.5156

3634.3886

3165.3542

3726.7552

128/128

1

27778.8885

26933.1886

15571.6549

6701.7958

5338.0166

8639.7933

128/2048

1

22948.5383

18995.2523

9150.7477

4963.4443

4250.6391

5101.6652

128/4096

1

15583.3035

11815.449

5368.9227

3011.3335

2568.5398

2774.5363

20000/2000

1

1649.5453

1301.4754

562.8735

316.533

291.4776

270.5404

2048/128

1

3619.4309

3460.3545

1904.3259

795.389

611.8446

986.9134

2048/2048

1

11032.9729

8777.6623

4159.6857

2264.9513

2011.1215

2018.303

500/2000

1

19510.4015

14993.328

7498.3331

3945.1912

3374.7133

4065.3921

5000/500

1

3787.6721

3258.2001

1708.0353

790.6631

703.56

855.9822

Mistral 7B

1000/1000

1

17739.1436

14986.7562

7697.1418

3804.5585

3333.4754

3981.4799

128/128

1

30094.9137

29341.284

16238.937

6914.2184

5491.7418

9127.5052

128/2048

1

24671.5477

20941.6631

9708.1161

5303.4318

4402.3044

5357.3405

128/4096

1

16454.0833

12780.3724

5800.4957

3235.0678

2825.7896

2879.9833

20000/2000

1

1676.0415

1317.9654

569.7589

324.5936

281.4751

286.353

2048/128

1

3649.1462

3492.3042

1929.3126

800.9286

617.0932

1019.75

2048/2048

1

11403.6968

8974.7383

4367.8733

2331.8112

1988.3496

2184.3861

500/2000

1

20819.4592

15992.3357

7947.4257

4189.395

3603.4489

4286.3867

5000/500

1

3840.0108

3340.7385

1707.2611

807.4561

722.8385

881.7336

Mixtral 8x22B

1000/1000

8

18557.43

16918.03

9759.888

4753.6273

2128.4403

128/128

8

25179.4765

23729.5293

16421.3182

6948.5923

2488.6297

128/2048

8

27492.4926

24556.7807

12303.4168

7246.7172

3540.0067

128/4096

8

19718.8648

17755.0018

7474.3817

4696.6123

2568.3114

20000/2000

8

2897.182

2189.606

1118.8294

594.8509

309.0799

2048/128

8

3093.8418

2917.1362

1994.0127

825.3934

294.7706

2048/2048

8

13795.9827

12487.6502

5857.8831

3377.8371

1694.6176

500/2000

8

24637.473

19997.3914

10637.6598

6007.619

2976.9633

5000/500

8

3889.2745

3578.4843

2211.2377

1028.3843

420.2156

Mixtral 8x7B

1000/1000

2

18712.2046

15931.8663

6052.876

3276.6186

1907.8817

4

32834.0923

28015.1981

15509.1538

7357.1613

4737.0179

5060.8399

8

44410.7533

40573.0499

27684.9381

13948.1533

4970.9287

5725.9638

128/128

2

24970.5594

24321.9927

15334.2103

5915.3897

3810.1846

4

42500.5855

40182.7271

27718.9857

11328.7486

6026.9206

6769.9441

8

54304.0436

51030.9048

40119.3268

17918.1146

5573.7682

6422.4308

128/2048

2

29314.1475

20945.7816

7409.9253

4284.3035

2248.1815

4

52680.8353

40668.5928

21293.1761

10929.0182

7353.7405

7506.7612

8

70409.1968

64529.9982

40839.3077

21058.2144

8866.251

9907.6896

128/4096

2

21520.4385

12070.6724

3928.6678

2302.964

1171.966

4

32550.5267

29120.2002

11678.0071

6538.1511

5176.9632

4958.7004

8

40373.4857

36357.7861

21628.821

13565.7778

7209.2336

8271.7938

20000/2000

2

2204.1378

1659.5907

622.2717

321.9839

185.6671

4

4047.7473

3290.9457

1602.0208

778.7285

572.4282

587.1759

8

6561.6849

5328.5261

3113.2047

1645.8114

750.5372

828.8471

2048/128

2

2958.0873

2883.5166

1796.5451

687.7251

465.1585

4

5229.8744

4972.6818

3354.994

1351.7191

728.4943

812.0143

8

7030.9766

6532.721

5025.3047

2248.6418

677.9886

771.3656

2048/2048

2

13842.834

9334.0732

3503.0218

1997.1923

1060.8946

4

22389.4914

20185.8212

9143.2741

4963.8758

3520.3659

3453.8076

8

28975.322

26176.9163

19291.8278

10552.9732

4590.187

4929.7228

500/2000

2

23459.0411

18185.6392

6023.3308

3438.6964

1817.11

4

39971.0236

31693.8787

17087.037

8930.3495

6117.5624

6434.9178

8

60721.462

48842.8084

31358.2791

17034.706

7118.0767

8130.8026

5000/500

2

3742.5293

3563.8228

1648.9041

733.1921

448.6716

4

6602.3877

6020.6267

3543.6819

1603.8223

948.0567

1047.3212

8

8862.8164

8214.9445

5968.7734

2813.1531

969.817

1098.3081

TP stands for Tensor Parallelism

Reproducing Benchmarked Results

[!NOTE] The only models supported in this workflow are those listed in the table above.

The following tables are references for commands that are used as part of the benchmarking process. For a more detailed description of this benchmarking workflow, see the benchmarking suite documentation.

Commands

Stage

Description

Command

Dataset

Create a synthetic dataset

python benchmarks/cpp/prepare_dataset.py --tokenizer=$model_name --stdout token-norm-dist --num-requests=$num_requests --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0 > $dataset_file

Build

Build a TensorRT-LLM engine

trtllm-bench --model $model_name build --tp_size $tp_size --quantization FP8 --dataset $dataset_file

Run

Run a benchmark with a dataset

trtllm-bench --model $model_name throughput --dataset $dataset_file --engine_dir $engine_dir

Variables

Name

Description

$isl

Benchmark input sequence length.

$osl

Benchmark output sequence length.

$tp_size

Number of GPUs to run the benchmark with

$engine_dir

Location to store built engine file (can be deleted after running benchmarks).

$model_name

HuggingFace model name eg. meta-llama/Llama-2-7b-hf or use the path to a local weights directory

$dataset_file

Location of the dataset file generated by prepare_dataset.py

$num_requests

The number of requests to generate for dataset generation

$seq_len

A sequence length of ISL + OSL

Preparing a Dataset

In order to prepare a dataset, you can use the provided script. To generate a synthetic dataset, run the following command:

python benchmarks/cpp/prepare_dataset.py --output=$dataset_file --tokenizer=$model_name token-norm-dist --num-requests=$num_requests --input-mean=$isl --output-mean=$osl --input-stdev=0 --output-stdev=0 > $dataset_file

The command will generate a text file located at the path specified $dataset_file where all requests are of the same input/output sequence length combinations. The script works by using the tokenizer to retrieve the vocabulary size and randomly sample token IDs from it to create entirely random sequences. In the command above, all requests will be uniform because the standard deviations for both input and output sequences are set to 0.

For each input and output sequence length combination, the table below details the $num_requests that were used. For shorter input and output lengths, a larger number of messages were used to guarantee that the system hit a steady state because requests enter and exit the system at a much faster rate. For longer input/output sequence lengths, requests remain in the system longer and therefore require less requests to achieve steady state.

Input Length

Output Length

$seq_len

$num_requests

128

128

256

30000

128

2048

2176

3000

128

4096

4224

1500

2048

128

2176

3000

2048

2048

4096

1500

Engine Building

All engines are built using the trtllm-bench build sub-command. The basic command for FP8 quantized engines is as follows:

trtllm-bench --model $model_name build --tp_size $tp_size --quantization FP8 --dataset $dataset_file

or if you would like to build for a specific sequence length:

trtllm-bench --model $model_name build --tp_size $tp_size --quantization FP8 --max_seq_length $seq_len

If you would like to build an FP16 engine without any quantization, simply remove the --quantization FP8 option.

[!NOTE] If you specify FP8 quantization, the KV cache will automatically be set to FP8 as well!

The trtllm-bench build sub-command will output the path where the engine is located upon a successful build. For example,

===========================================================
ENGINE SAVED: /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1
===========================================================

Running the Benchmark

To run the benchmark with the generated data set, simply use the trtllm-bench throughput sub-command. The benchmarker will run an offline maximum throughput scenario such that all requests are queued in rapid succession. You simply need to provide the patch to the engine from the build phase and a generated dataset.

trtllm-bench --model $model_name throughput --dataset $dataset_file --engine_dir $engine_dir

The results will be printed to the terminal upon benchmark completion. For example,

===========================================================
= ENGINE DETAILS
===========================================================
Model:                  meta-llama/Llama-2-7b-hf
Engine Directory:       /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1
TensorRT-LLM Version:   0.12.0
Dtype:                  float16
KV Cache Dtype:         FP8
Quantization:           FP8
Max Input Length:       2048
Max Sequence Length:    4098

===========================================================
= WORLD + RUNTIME INFORMATION
===========================================================
TP Size:                1
PP Size:                1
Max Runtime Batch Size: 4096
Max Runtime Tokens:     8192
Scheduling Policy:      Guaranteed No Evict
KV Memory Percentage:   99.0%
Issue Rate (req/sec):   3.680275266452667e+18
===========================================================
= STATISTICS
===========================================================
Number of requests:             3000
Average Input Length (tokens):  128.0
Average Output Length (tokens): 128.0
Token Throughput (tokens/sec):  23405.927228471104
Request Throughput (req/sec):   182.8588064724305
Total Latency (seconds):        16.406100739
===========================================================

[!WARNING] In some cases, the benchmarker may not print anything at all. This behavior usually means that the benchmark has hit an out of memory issue. Try reducing the KV cache percentage using the --kv_cache_free_gpu_mem_fraction option to lower the percentage of used memory.