Disaggregated Serving (Beta)#
Note
Note: This feature is currently in beta, and the related APIs are subjected to change in future versions.
Motivation#
LLM inference has two stages: context (prefill) and generation (decode) phases. The context phase computes KV cache for prompt tokens whereas the generation phase generates tokens one by one using cached values. These phases have different compute characteristics.
There are two ways of serving LLM inference requests:
Aggregated LLM serving (sometimes called in-flight batching or IFB in this tech blog), in which the context and generation phases are run on the same GPU.
Disaggregated LLM serving, in which the context and generation phases are run on different GPUs.

Figure 1. The execution timeline of aggregated LLM serving
In aggregated LLM serving, both the context and generation phases share the same GPU resources and parallelism strategy. This can lead to interference where context processing delays token generation, increasing token-to-token latency (TPOT) and reducing interactivity. This is illustrated in Figure 1 which shows the execution timeline for aggregated LLM serving. Aggregated LLM serving also forces a single GPU type and parallelism configuration for both phases, even though their compute needs differ. As a result, optimizing for one metric such as time-to-first-token (TTFT), often comes at the expense of another metric such as TPOT.

Figure 2. The execution timeline of dis-aggregated LLM serving
Disaggregated serving resolves these challenges by decoupling the two phases, allowing each to run on separate GPU pools and using different parallelism strategies. This separation removes the interference between context and generation phases, as shown in Figure 2, and enables independent optimization of TTFT and TPOT. Although disaggregation incurs overhead for transferring the KV cache blocks from context to generation GPUs, the advantages can be substantial—particularly for workloads with long input sequences and moderate output lengths where interference is most severe.
You can also refer to this paper for more details about the rational and design considerations of disaggregated serving.
KV Cache Exchange#
Multi-backend Support#
In TensorRT-LLM, the KV cache exchange is modularly decoupled from the KV cache manager and the underlying communication libraries, as shown in Figure 3. The KV cache exchange module is responsible for efficient transmission and reception of the cache, promptly releasing cache space, and performing cache layout conversions during the exchange process. Currently, mainstream communication protocols—MPI, UCX, and NIXL—are all supported by TensorRT-LLM, and the underlying communication protocols utilize RDMA / NVLink. Currently, we recommend using UCX and NIXL backends, as we are adding a dynamic scaling mechanism on top of them—specifically, dynamic node joining and leaving. This allows customers to adjust the load based on traffic demands or switch roles between context and generation dynamically.

Figure 3. KV cache exchange architecture
Overlap Optimization#
To optimize the overall performance of disaggregated serving, TensorRT-LLM overlaps the KV cache transmission with computation for multiple independent requests. While one request is sending or receiving its KV cache blocks, other requests can proceed with computation, as illustrated in Figure 4. Furthermore, if context and generation instances are using multiple GPUs per instance, KV cache transmission between different sets of GPUs can occur in parallel.

Figure 4. KV cache exchange timing diagram
Cache Layout Transformation#
To minimize KV cache transmission latency, TensorRT-LLM currently uses direct transmission between device memories for cache transfer. The KV cache transmission supports using different parallel strategies for the context and generation phases. In such cases, careful orchestration of KV cache block mapping is required. Figure 5 illustrates this using the example of context phase with TP2 and generation phase with PP2.

Figure 5. KV cache layout conversion
The optimizations required for KV cache transmission vary depending on whether it’s single-node multi-GPU, multi-node multi-GPU, or different GPU models. To accommodate this, TensorRT-LLM provides a set of environment variables for selection in different environments. Please refer to the following section for details Environment Variables.
Usage#
trtllm-serve#
The first approach to do disaggregated LLM inference with TensorRT-LLM involves launching a separate OpenAI-compatible server per context and generation instance using trtllm-serve
. An additional server, referred to as the “disaggregated” server, is also launched with trtllm-serve
and acts as an orchestrator which receives client requests and dispatches them to the appropriate context and generation servers via OpenAI REST API. Figure 6 below illustrates the disaggregated serving workflow when using this approach. When a context instance is done generating the KV blocks associated with the prompt, it returns a response to the disaggregated server. This response includes the prompt tokens, the first generated token and metadata associated with the context request and context instance. This metadata is referred to as context parameters (ctx_params
in Figure 6). These parameters are then used by the generation instances to establish communication with the context instance and retrieve the KV cache blocks associated with the request.

Figure 6. `trtllm-serve` integration with disaggregated service
To run TRT-LLM in disaggregated mode, you must first launch context (prefill) and generation (decode) servers using trtllm-serve
.
We use the cache_transceiver_config
configuration to set up disaggregated serving, which includes the following parameters:
cache_transceiver_config:
backend: <str>
max_tokens_in_buffer: <int>
backend
specifies the communication backend for transferring the kvCache, valid options include DEFAULT
,UCX
, NIXL
, and MPI
, the default backend is UCX.
max_tokens_in_buffer
defines the buffer size for kvCache transfers, it is recommended to set this value greater than or equal to the maximum ISL (Input Sequence Length) of all requests for optimal performance.
For example, you could launch two context servers and one generation servers as follows:
# Generate context_extra-llm-api-config.yml
# Overlap scheduler for context servers are disabled because it's not supported for disaggregated context servers yet
echo -e "disable_overlap_scheduler: True\ncache_transceiver_config:\n backend: UCX\n max_tokens_in_buffer: 2048" > context_extra-llm-api-config.yml
# Start Context servers
CUDA_VISIBLE_DEVICES=0 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8001 --backend pytorch --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_0 &
CUDA_VISIBLE_DEVICES=1 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8002 --backend pytorch --extra_llm_api_options ./context_extra-llm-api-config.yml &> log_ctx_1 &
# Generate gen_extra-llm-api-config.yml
echo -e "cache_transceiver_config:\n backend: UCX\n max_tokens_in_buffer: 2048" > gen_extra-llm-api-config.yml
# Start Generation servers
CUDA_VISIBLE_DEVICES=2 trtllm-serve TinyLlama/TinyLlama-1.1B-Chat-v1.0 --host localhost --port 8003 --backend pytorch --extra_llm_api_options ./gen_extra-llm-api-config.yml &> log_gen_0 &
Once the context and generation servers are launched, you can launch the disaggregated server, which will accept requests from clients and do the orchestration between context and generation servers. The disaggregated server can be launched with:
trtllm-serve disaggregated -c disagg_config.yaml
where disagg_config.yaml
contains information about the context and generation servers. For the current example,
it would look like:
hostname: localhost
port: 8000
backend: pytorch
context_servers:
num_instances: 2
urls:
- "localhost:8001"
- "localhost:8002"
generation_servers:
num_instances: 1
urls:
- "localhost:8003"
When routing requests to the context servers, the disaggregated server will mark the requests as “context-only” to skip the generation phase. Similarly, when routing requests to the generation servers, the disaggregated server will mark the requests as “generation-only” to skip the context phase.
Clients can then send requests to the disaggregated server at localhost:8000
, which is an OpenAI compatible endpoint. For example, you can send requests to the disaggregated server using curl:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
"prompt": "NVIDIA is a great company because",
"max_tokens": 16,
"temperature": 0
}' -w "\n"
Launching disaggregated servers on SLURM clusters#
Please refer to Disaggregated Inference Benchmark Scripts.
Dynamo#
The second approach involves the use of Dynamo, a data center-scale inference server developed specifically for LLM workloads. Dynamo introduces several advanced features not present in the other methods, including decoupled pre- and post-processing workers, which are particularly beneficial under high concurrency conditions. The disaggregated LLM inference workflow with Dynamo is illustrated in Figure 7.

Figure 7. Dynamo integration with disaggregated service
In the Dynamo workflow, requests are initially processed by pre- and post-processing workers, which then query a smart router to determine the optimal decode worker to route the requests to. Depending on the availability of KV cache blocks, the decoder worker may bypass the prefill stage or forward the request to the prefill worker. Once the prefill worker is done processing the prompt, the KV cache blocks can be sent from the prefill worker to the decoder worker, using the metadata referred to as ctx_params in the figure above.
Dynamo also includes built-in support for Kubernetes deployment, monitoring, and metrics collection. The development team is actively working on enabling dynamic instance scaling, further enhancing its suitability for production environments.
For more information on how to use Dynamo with TensorRT-LLM, please refer to this documentation.
Environment Variables#
TRT-LLM uses some environment variables to control the behavior of disaggregated service.
TRTLLM_PARALLEL_CACHE_SEND
: If set to1
, contextExecutor will attempt to send KV cache for multiple requests in parallel. The default value is0
.TRTLLM_DISABLE_KV_CACHE_TRANSFER_OVERLAP
: If set to1
, generationExecutor will not overlap KV cache transfer with model inference. The default value is0
.TRTLLM_ENABLE_KVCACHE_RECEIVE_PARALLEL
: When the generation rank receives KV cache from multiple context ranks within a single context instance, it will receive KV cache from each rank sequentially. If set to1
, the generation rank will receive KV cache from each rank within one context instance in parallel. The default value is0
.TRTLLM_REQUEST_KV_CACHE_CONCURRENT
: If set to1
, generationExecutor prepares independent resources for each context executor to receive KV cache, requests whose KV cache are received from different context executors will be processed concurrently. If set to0
, the generation executor will reuse the same resource to process KV cache transfer for each request sequentially, reducing the resources used by KV cache transmission and thereby lowering the risk of running out of memory. The default value is0
.TRTLLM_TRY_ZCOPY_FOR_KVCACHE_TRANSFER
: TRT-LLM typically copies non-contiguous data into a temporary buffer before sending KV cache. If set to1
, TRT-LLM will attempt to directly transmit each KV cache block, eliminating extra copies. The default value is0
.TRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE
: By default, TRT-LLM uses astream-ordered memory allocator
to allocate temporary buffers. If this environment variable is set to #Size, TRT-LLM will usecudaMalloc
to allocate buffer of size #Size for KV cache transmission. The default value is512MB
. Users can setTRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE=1GB
to allocate a 1 GB buffer withcudaMalloc
for KV cache transmission.TRTLLM_KVCACHE_TRANSFER_USE_ASYNC_BUFFER
: If set to1
, TRT-LLM will usecudaMallocAsync
to allocate buffers for KV cache transmission. The default value is0
. This environment variable only takes effect whenTRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE
is greater than 0.TRTLLM_KVCACHE_SEND_MAX_CONCURRENCY_NUM
: The maximum number of concurrent KV cache sends. The default value is4
. This environment variable only takes effect whenTRTLLM_KVCACHE_TRANSFER_BUFFER_SIZE
is greater than 0.
There are some other useful environment variables that may help when encountering failures or performance issues.
NCCL_GRAPH_MIXING_SUPPORT
: With the default value1
, the CUDA driver may create too many CUDA streams while working with one CUDA graph, leading to performance drop. Setting it to0
will reduce the number of CUDA streams, but please make sure there are no other NCCL ops outside the one CUDA graph, otherwise it’s unsafe.``UCX_MAX_RNDV_RAILS`: With the default value 2, UCX attempts to use two InfiniBand (IB) NIC devices per GPU for Rendezvous (RNDV) transfers. When both the context and generation instances enable tensor- and expert-parallel (TEP), multiple TP ranks may transfer KV cache concurrently. Because each TP rank can use up to two NIC devices, some NIC devices can be shared across GPUs, causing contention and reduced throughput. Setting UCX_MAX_RNDV_RAILS=1 can reduce contention in this case.
Troubleshooting and FAQ#
General FAQs#
Q. What are the limitations of disaggregated serving in TRT-LLM?
A. Currently, only decoder-only models and beam width of 1 are supported. Also the KV cache at each layer of the model is required to be homogeneous, with the same data type and the same number of attention heads.
Q. When using the TRT backend, is the engine used for disaggregated serving different from other engines?
A. No. There are no special requirements for the arguments to build engine.
Q. When using the TRT backend, do the engines used by the context and generation instances need to be the same?
A. No. The engines used by context and generation instances can be different, and their parallelism can be heterogeneous, i.e., TP,PP can be different, and TRT-LLM will handle the heterogeneity of KV cache.
Q. Can a TRT-LLM server instance handle both context-only requests and generation-only requests?
A. Yes, but it’s not recommended. TRT-LLM does not implement optimal scheduling for the case where the instance handles mixed context-only requests and generation-only requests. It’s better to run context-only requests and generation-only requests on sets of servers.
Q. Does disaggregated serving in TRT-LLM support multi-gpu and multi-node?
A. Yes, it’s recommended that different server instances use different GPUs. We support running context and generation servers on the same node or different nodes. The CUDA_VISIBLE_DEVICES
env variable can be used to control which GPUs are used by each instance.
Debugging FAQs#
Q. How to handle error Disaggregated serving is not enabled, please check the configuration?
A. please set backendType
of CacheTransceiverConfig
.
ExecutorConfig executorConfig{...};
executorConfig.setCacheTransceiverConfig(texec::CacheTransceiverConfig(BackendType::DEFAULT));
Q. Does TRT-LLM support using GPU direct RDMA for inter-node KV Cache transfer?
A. Yes, TRT-LLM supports using GPU direct RDMA for inter-node KV cache transfer.
Q. What causes the substantial bandwidth fluctuations in kvCache transfers, especially during the first few requests following service initialization?
A. The communication for kvCache transfer between executors are established dynamically. The connection establishment process incurs significant overhead, which explains the apparently lower kvCache transfer bandwidth observed during the initial requests after service startup. This lower bandwidth reflects the inclusion of connection establishment overhead. When conducting benchmarks, it is recommended to perform a warm-up phase to ensure accurate performance measurements.
Q. When my servers are running on different NVLink domains, some servers hang or have a lower performance. How to fix that?
A. NVLink domain can be found with nvidia-smi -q
in the Fabric.ClusterUUID
field. A few UCX environment variables can be adjusted when your servers have different NVLink domains:
UCX_CUDA_IPC_ENABLE_MNNVL
: Set ton
. This also can reduce UCX timeout error messages likeUCX ERROR cuMemImportFromShareableHandle failed: invalid resource handle
, although these errors don’t necessarily cause your trtllm-serve to fail.UCX_NET_DEVICES
: Check if this is set correctly, or unset this variable to allow UCX to use all possible devices.UCX_RNDV_SCHEME
: Set toget_zcopy
orput_zcopy
on GB200 for better performance. The default value isauto
.