New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget#

XQA kernel provides optimization for MQA and GQA during generation phase. It also provides optimization for beam search. Using tensor cores for acceleration, reducing data loading and conversion, it delivers increased throughput within the same latency budget. Increased throughput allows serving greater number of user requests while providing the same experience.

Support matrix and usage flags are described in docs/source/advanced/gpt_attention.

Increased Throughput: Looking at the Throughput-Latency curves below, we see that the enabling of XQA optimization increases throughput. Higher throughput equates to serving more users, and we can see that TPOT on the Y-axis flattens out when XQA gets enabled.

XQA increased throughput within same latency budget

_{Preliminary measured Performance, subject to change. TPOT lower is better. FP8, 8xH100 GPUs, Single Engine, ISL/OSL: 512/2048, BS: 1 - 256, TensorRT-LLM v0.8a}

Llama-70B on H200 up to 2.4x increased throughput with XQA within same latency budget#

H200 2.4x with XQA

Model	GPUs	Input Length	Output Length	Throughput w/o XQA (tok/s/GPU)	Throughput w/ XQA (tok/s/GPU)	Speedup
Llama-70B	1	128	2048	1,227	2,941	2.4x
	8	128	2048	13,232	25,300	1.9x

Closing#

These improvements will be published in the main branch soon, and will be included in the v0.8 releases.

For more information about H200, please see the H200 announcement blog.

Throughput is calculated as output tokens per second per gpu. out_tps=output_seqlen*batch_size/total_latency/tp

_{Glossary:
| DP = Data Parallel
ISL = Input Sequence Length
| PP = Pipeline Parallel
| OSL = Output Sequence Length
| OOM = Out of Memory
| TP = Tensor Parallel}