Pushing Latency Boundaries: Optimizing DeepSeek-R1 Performance on NVIDIA B200 GPUs#
by NVIDIA TensorRT-LLM team
Table of Contents#
Background#
Recent advancements in Large Language Reasoning Models have demonstrated remarkable success, while creating new deployment challenges. A critical challenge emerges from extended Output Sequence Lengths (OSL) due to complex “thinking and reasoning” processes. Longer OSL demands stricter Token-to-Token Latency (TTL) requirements, often forcing concurrency limitations. The most extreme case, single concurrency (min-latency scenario) , becomes particularly challenging for real-time applications.
This article explores how TensorRT-LLM achieves record-breaking performance for DeepSeek-R1 in min-latency scenarios on NVIDIA’s 8×B200 GPU configuration progressing from 67 tokens per second (TPS) to 253 before GTC 2025(3.7x speed-up), and to our current number is 368 TPS (5.5x speed-up).
Implementation Configuration#
Workload Profile#
Input Sequence Length (ISL): 1k tokens
Output Sequence Length (OSL): 2k tokens
Model Architecture#
The base DeepSeek-R1 main model contains: 3x dense layers (initial) and 58x MoE layers, there is also 1x Multi-Tokens Prediction (MTP) layer (MoE-architecture equivalent) for speculative decoding. Our optimized configuration extends the MTP layer to 3x layers using autoregressive styling for peak performance exploration.

Precision Strategy#
We have explored a mixed precision recipe, which provides a better tradeoff between accuracy and performance.
Component |
Precision |
---|---|
64x Attention Modules |
bf16* |
3x Dense FFN Layers |
nvfp4** |
58x MoE FFN Layers |
nvfp4 |
3x MTP Layers |
bf16 |
RouterGEMM*** |
bf16 |
*TensorRT-LLM already supports FP8 Attention while for this latency scenario low-precision attention computation doesn’t help with performance so we choose to use bf16 precision for the Attention Modules.
** nvfp4 model checkpoint is generated by the NVIDIA TensorRT Model Optimizer toolkit.
*** RouterGEMM uses bf16 inputs/weights with fp32 outputs for numerical stability
Parallelism Strategy#
We have also explored and introduced mixed parallel strategy on 8xB200 GPUs. Specifically, the best strategy for this latency scenario is ‘TP8EP2’, the definition represents
Component |
Parallelism Patterns |
---|---|
Attention Modules |
Tensor Parallelism 8 (TP8) |
MoE Sparse Experts |
Mixed TP4 with Expert Parallelism 2 (EP2) |
MoE Shared Experts |
TP8 |
Fuse_A GEMM |
Data Parallelism 8 (DP8) |
RouterGEMM |
DP8 |
Everything in One Diagram#
Now let’s put everything into one diagram, which represents a MoE layer from a decoding iteration.

The modules in the diagram are:
Input Module: A BF16 tensor with shape [m, 7168], where m is the number of tokens (for instance, m = 4 when using three MTP layers), and 7168 is the model’s hidden size.
Module1: Fuse_A_GEMM Concatenates the weights for WDQ, WDKV, and WKR to reduce kernel launch overhead.
Module2: 2× RMSNorm Performs normalization for Q/K tensors. These can be either overlapped on multiple streams or fused into a single grouped RMSNorm.
Module3: UQ_QR_GEMM Concatenates WUQ and WQR weights to reduce kernel launch overhead.
Module4: UK_BGEMM Uses WUK in a batched GEMM. We avoid absorbing Modules 3 and 4 to prevent weight-size inflation and extra loading costs.
Module5: Concat KVCache & applyRope Merges K/V cache and applies ROPE (Rotary Positional Encoding).
Module6: genAttention Performs MLA during generation, acting like an MQA with num_q_heads = 128 / TP8 = 16.
Module7: UV_GEMM Executes a batched GEMM with WUV weights.
Module8: WO_GEMM Runs a dense GEMM using WO weights. We do not absorb Modules 7 and 8 to avoid increased weight loading overhead.
Module9: Fused Kernels Incorporates oneshotAllReduce, Add_RMSNorm, and DynamicQuant (BF16->NVFP4) in a single kernel.
Module10: routerGEMM & topK Handles the router GEMM and topK selection.
Module11: Shared Expert Overlaps partially with Module10 and Module 12.
Module12: Sparse Experts Implements expert layers via grouped GEMM.
Module13: Final Fused Kernels Performs localReduction, oneshotAllReduce, and Add_RMSNorm operations together.
Key Optimizations#
Feature |
TPS/User |
Code Links / Notes |
---|---|---|
Baseline: CUDA Graph + EP8TP8 |
67 |
|
Multi Stream to overlap shared expert with sparse experts |
73 |
|
Optimize MLA Kernel |
80 |
|
Optimize TopK Kernels |
84 |
|
Optimize Fuse_A_GEMM |
89 |
|
MTP3_Vanilla |
154 |
evolve to MTP3_Autoregressive |
Evolve to MTP3_Autoregressive + Optimize Router GEMM |
164 |
|
Fuse oneshotAR + RMSNorm |
168 |
|
Enable PDL |
173 |
Set environment variable: |
Multi-stream to overlap two RMS_norms |
180 |
|
MTP3_Autoregressive |
204 |
|
Finetune clock/power |
211 |
|
Optimize CUTLASS Grouped GEMM Kernels |
236 |
The code is not open-source yet due to the dependency with internal base environment and we are planning to make it decoupled from internal base environment thus to be able to open-source in the future. |
Optimize CUTLASS Flow: Sparse Experts as GEMMs |
249 |
The code is not open-source yet due to the dependency with internal base environment and we are planning to make it decoupled from internal base environment thus to be able to open-source in the future. |
Introduce EP4TP2 for better workload balance |
253 |
Use |
Introduce moe_backend=TRTLLM, EP2TP4 for better balance |
299 |
|
Optimize Fuse_A_GEMM and Router_GEMM |
340 |
WIP |
Relax Acceptance |
368 |
System Level optimizations#
CUDA Graph & Programmatic Dependent Launch#
CUDA Graph is necessary to overcome the CPU-overhead for small workloads, while Programmatic Dependent Launch can be used to reduce the kernel launch latency furthermore.
MTP#
There are two optimizations based on MTP
Autoregressive MTP Layers#
Version |
Acceptance Rate |
TPS/User |
TPS/User Speedup |
---|---|---|---|
Without MTP |
1.00 |
111 |
1.00 |
MTP 1 |
1.92 |
198 |
1.78 |
MTP 2 |
2.58 |
250 |
2.25 |
MTP 3 |
2.82 |
253 |
2.28 |
MTP 4 |
2.99 |
245 |
2.21 |
MTP 5 |
3.01 |
239 |
2.15 |
Based on our exploration, 3x MTP layers configuration demonstrates optimal performance.
Relax Acceptance Verification#
For the reasoning model (such as DeepSeek R1), the generation may consist of two phases: thinking phase and actual output. During the thinking phase, when relaxed acceptance is enabled, the draft token can be accepted when it is in a candidate set. This candidate is generated based on the logits topN and probability threshold.
topN: The topN tokens are sampled from logits.
Probability threshold. Based on topN candidates, only those tokens with a probability greater than the Top1’s probability - delta can remain in the candidate set.
During the non-thinking phase, we still use strict acceptance.
Version |
Acceptance Rate |
TPS/User Speedup |
---|---|---|
MTP3_top1, d0.0 |
2.82 |
1.00 |
MTP3_top10, d0.5 |
3.06 |
1.08 |
MTP3_top10, d0.6 |
3.10 |
1.09 |
MTP3_top15, d0.5 |
3.07 |
1.08 |
This is a relaxed way of verification and comparison, which can improve the acceptance rate and bring positive speedup with limited influence on accuracy.
Dataset |
Test Size |
w/o Relaxed accuracy |
w/ Relaxed accuracy |
---|---|---|---|
MMLU-Pro |
12,032 |
84.0% |
81.2% |
Humanity’s Last Exam |
2,684 |
9.0% |
9.0% |
GPQA Diamond |
198 |
71.0% |
69.2% |
MATH-500 |
500 |
96.0% |
96.2% |
AIME 2024 |
30 |
68.0% |
74.0% |
SciCode |
338 |
36.0% |
39.0% |
LiveCodeBench |
315 |
62.0% |
66.0% |
For more information, please visit multi-token-prediction-mtp
Multi-streams#
We have introduced multi-streams based optimizations to hide some kernels’ overhead, such as:
Overlap shared experts with sparse experts
Overlap Concat_KVCache kernel with GEMM
Sparse Experts as GEMMs (only works when moe_backend=CUTLASS)#

The existing CUTLASS-based Sparse Experts flow (illustrated in the figure) dispatches input tokens to their designated experts, then applies indexed local reduction on each expert’s outputs before a global allreduce. Both dispatching and indexed local reduction incur high overhead in low-latency scenarios. To address this, we propose treating “Sparse Experts as GEMMs” by sending all tokens to each activated expert and masking out unneeded outputs before local reduction. Because grouped GEMMs are memory-bound, the extra computations from redundant tokens have minimal impact, effectively eliminating the costly dispatch and reduction overhead.
Re-balanced the sparse experts#
For sparse experts, two parallelization strategies are commonly used: Expert Parallel (EP) and Tensor Parallel (TP). Expert Parallel (EP) maps each expert to a distinct GPU, achieving high memory and computational efficiency. However, token placement is data-dependent, distributing workloads unevenly across GPUs and revealing overhead in the AllReduce step after the MoE module. Tensor Parallel (TP) shards each expert evenly across GPUs, creating a balanced workload but sacrificing math/memory efficiency.
Mixed ETP#
A combined EP/TP approach can mitigate both challenges. In practice, our experiments show that a configuration of TP4EP2 offers the best performance.
Smart Router#
Alternatively, by storing all expert weights on a cluster of four GPUs and replicating them to another four-GPU cluster, a smart router can dynamically dispatch tokens across each cluster. This design keeps balanced workload distribution even without significantly impacting local memory and computation efficiency.
Kernel Level optimizations#
Attention Kernel#
We have developed a customized MLA attention kernel to better utilize GPU resources for latency scenarios.
Grouped GEMM#
CUTLASS Backend (default backend)#
Our default MoE backend is based on CUTLASS, which is flexible/robust but may not be the best performance case.
TRTLLM Backend#
The other MoE backend is TRTLLM, which provides better performance, and we are working to make it more flexible and robust, and in the future it will be switched as the default backend for Grouped GEMM computation for latency scenarios.
Communication Kernel#
For small message sizes, regular NCCL latency-bound AllReduce kernels are inefficient, so we’ve developed a customized oneshot AllReduce kernel. It leverages the powerful NVSwitch HW capability by acting like an initial broadcast followed by local reduction, delivering better performance in min-latency scenarios.
Dense GEMM optimization#
We focus on optimizing two kinds of dense GEMMs: Fuse_A_GEMM and RouterGEMM, because they dominate the execution time, suffer from low memory efficiency, and cannot be easily sharded (they are DP-based).
Fuse_A_GEMM#
We developed a custom Fuse_A_GEMM that prefetches the majority of its weights into shared memory (enabled by PDL and overlapped with oneshot-AllReduce), significantly enhancing performance. The kernel shows substantial improvements over default GEMM implementation when num_tokens < 16.

RouterGEMM#
By leveraging our internal AI code generator, we automatically generate an optimized RouterGEMM kernel, which delivers substantial improvements over the default GEMM implementation when num_tokens <=30.

Kernel fusion#
Kernel fusion is necessary for min-latency scenario to reduce extra global memory write/read cost, and we support following fusion patterns now
Fuse two overlapped RMS_Norms into one GroupedRMSNorm
Fuse (LocalReduction) + AR+ RMS_Norm+ (Dynamic_Quant_bf16tonvfp4) into one kernel
Fuse Grouped GEMM_FC1 + dot activation (when moe_backend=TRTLLM) into one kernel
How to reproduce#
https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/blogs/Best_perf_practice_on_DeepSeek-R1_in_TensorRT-LLM.md#b200-min-latency
Of note, the Relaxed Acceptance is specific for Deepseek-R1 model, if you want to enable it, you need to set add_generation_prompt = True
when preparing the benchmark dataset, the code demo likes
input_ids = tokenizer.encode(tokenizer.apply_chat_template(msg, tokenize=False, add_generation_prompt=True), add_special_tokens=False)
It’s also needed to set use_relaxed_acceptance_for_thinking: true
, relaxed_topk: 10
and relaxed_delta: 0.6
in speculative_config.
Future Works#
More Fusions
More Overlap
More optimization of Attention Kernel
More Exploration of MTP
Acknowledgment#
Pushing the performance boundaries of DeepSeek R1 for latency-sensitive applications has been a remarkable engineering journey. The optimizations detailed in this post represent an exceptional cross-functional collaboration across the entire AI technology stack - spanning kernel-level optimizations, runtime enhancements, model quantization techniques, algorithmic improvements, and systematic performance analysis and tuning. While we can’t individually acknowledge every contributor, we’re proud to recognize the dedicated team of engineers whose collective expertise has helped advance the state-of-the-art in TensorRT-LLM performance engineering.
Through this collaborative endeavor, we’ve developed valuable insights into maximizing GPU utilization for large language model inference. We hope that the techniques and best practices shared in this blog will empower the developer community to better leverage NVIDIA GPU capabilities in their mission-critical LLM inference applications.