Helix Parallelism#
Helix is a context parallelism (CP) technique for the decode/generation phase of LLM inference. Unlike traditional attention-FFN disaggregation (AFD) techniques, which spatially separate attention and FFN blocks onto different GPUs, Helix temporally separates them by reconfiguring the same GPUs.
For all details, see the original paper: Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding
How Helix Works#
In Helix parallelism:
KV cache distribution: The KV cache is partitioned across CP ranks during generation, with each rank responsible for a portion of the cached context
Attention computation: Each rank computes partial attention over its local KV cache shard
Attention postprocessing: Partial results are combined / corrected across ranks to produce the final attention output
FFN layers: CP ranks are repurposed as tensor parallelism (TP) ranks for FFN/MoE layers, maximizing GPU utilization
When to Use Helix#
Helix parallelism provides performance benefits when all of the following conditions apply:
Disaggregated serving: Helix is designed for generation servers in a disaggregated (prefill/decode split) deployment architecture
Long input sequences: Performance gains typically appear with input sequence lengths >64K tokens or more
Low batch sizes: Optimal for latency-sensitive workloads with high tokens/second/user requirements
On a typical latency vs. throughput Pareto curve, Helix targets operating points toward the right side (low latency, high per-user throughput).
Supported Models#
Helix parallelism currently supports models using Multi-head Latent Attention (MLA) on Blackwell GPU architecture:
DeepSeek-V3 / DeepSeek-V3-Lite
Configuration#
Configuration Parameters#
Please set the following parameters for the generation servers in disaggregated mode. Example can be seen in the e2e accuracy test mentioned below.
Parameter |
Description |
Required |
|---|---|---|
|
Number of GPUs for context parallelism (≥2 for Helix) |
Yes |
|
Must be |
Yes |
|
Tokens per KV cache block |
Yes |
|
Must match |
Yes |
JSON Configuration (for YAML/JSON configs)#
{
"context_parallel_size": 2,
"cp_config": {
"cp_type": "HELIX",
"tokens_per_block": 32
},
"kv_cache_config": {
"tokens_per_block": 32
}
}
Testing Helix with TensorRT-LLM#
Unit Test: MLA Module Correctness#
The simplest correctness test validates the MLA attention module with Helix enabled:
# Run the MLA Helix unit test
pytest tests/unittest/_torch/modules/test_mla_helix.py -v
This test verifies that attention outputs match between single-GPU and Helix-parallelized execution.
End-to-End Accuracy test#
For end-to-end validation, the accuracy benchmark evaluates DeepSeek-V3-Lite in disaggregated mode on MMLU and GSM8K benchmarks:
Test location: tests/integration/defs/accuracy/test_disaggregated_serving.py
Test name: TestDeepSeekV3Lite::test_auto_dtype_with_helix
This test demonstrates proper disaggregated server configuration with Helix.