trtllm-serve#
About#
The trtllm-serve
command starts an OpenAI compatible server that supports the following endpoints:
/v1/models
/v1/completions
/v1/chat/completions
For information about the inference endpoints, refer to the OpenAI API Reference.
The server also supports the following endpoints:
/health
/metrics
/version
The metrics
endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details.
Starting a Server#
The following abbreviated command syntax shows the commonly used arguments to start a server:
trtllm-serve <model> [--backend pytorch --tp_size <tp> --pp_size <pp> --ep_size <ep> --host <host> --port <port>]
For the full syntax and argument descriptions, refer to Syntax.
Inference Endpoints#
After you start the server, you can send inference requests through completions API and Chat API, which are compatible with corresponding OpenAI APIs. We use TinyLlama-1.1B-Chat-v1.0 for examples in the following sections.
Chat API#
You can query Chat API with any http clients, a typical example is OpenAI Python client:
1### OpenAI Chat Client
2
3from openai import OpenAI
4
5client = OpenAI(
6 base_url="http://localhost:8000/v1",
7 api_key="tensorrt_llm",
8)
9
10response = client.chat.completions.create(
11 model="TinyLlama-1.1B-Chat-v1.0",
12 messages=[{
13 "role": "system",
14 "content": "you are a helpful assistant"
15 }, {
16 "role": "user",
17 "content": "Where is New York?"
18 }],
19 max_tokens=20,
20)
21print(response)
Another example uses curl
:
1#! /usr/bin/env bash
2
3curl http://localhost:8000/v1/chat/completions \
4 -H "Content-Type: application/json" \
5 -d '{
6 "model": "TinyLlama-1.1B-Chat-v1.0",
7 "messages":[{"role": "system", "content": "You are a helpful assistant."},
8 {"role": "user", "content": "Where is New York?"}],
9 "max_tokens": 16,
10 "temperature": 0
11 }'
Completions API#
You can query Completions API with any http clients, a typical example is OpenAI Python client:
1### OpenAI Completion Client
2
3from openai import OpenAI
4
5client = OpenAI(
6 base_url="http://localhost:8000/v1",
7 api_key="tensorrt_llm",
8)
9
10response = client.completions.create(
11 model="TinyLlama-1.1B-Chat-v1.0",
12 prompt="Where is New York?",
13 max_tokens=20,
14)
15print(response)
Another example uses curl
:
1#! /usr/bin/env bash
2
3curl http://localhost:8000/v1/completions \
4 -H "Content-Type: application/json" \
5 -d '{
6 "model": "TinyLlama-1.1B-Chat-v1.0",
7 "prompt": "Where is New York?",
8 "max_tokens": 16,
9 "temperature": 0
10 }'
Multimodal Serving#
For multimodal models (e.g., Qwen2-VL), you’ll need to create a configuration file and start the server with additional options:
First, create a configuration file:
cat >./extra-llm-api-config.yml<<EOF
kv_cache_config:
enable_block_reuse: false
EOF
Then, start the server with the configuration file:
trtllm-serve Qwen/Qwen2-VL-7B-Instruct \
--extra_llm_api_options ./extra-llm-api-config.yml \
--backend pytorch
Completions API#
You can query Completions API with any http clients, a typical example is OpenAI Python client:
Another example uses curl
:
Benchmark#
You can use any benchmark clients compatible with OpenAI API to test serving performance of trtllm_serve
, we recommend genai-perf
and here is a benchmarking recipe.
First, install genai-perf
with pip
:
pip install genai-perf
Then, start a server with trtllm-serve
and TinyLlama-1.1B-Chat-v1.0
.
Finally, test performance with the following command:
1#! /usr/bin/env bash
2
3genai-perf profile \
4 -m TinyLlama-1.1B-Chat-v1.0 \
5 --tokenizer TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
6 --service-kind openai \
7 --endpoint-type chat \
8 --random-seed 123 \
9 --synthetic-input-tokens-mean 128 \
10 --synthetic-input-tokens-stddev 0 \
11 --output-tokens-mean 128 \
12 --output-tokens-stddev 0 \
13 --request-count 100 \
14 --request-rate 10 \
15 --profile-export-file my_profile_export.json \
16 --url localhost:8000 \
17 --streaming
Refer to README of genai-perf
for more guidance.
Multi-node Serving with Slurm#
You can deploy DeepSeek-V3 model across two nodes with Slurm and trtllm-serve
echo -e "enable_attention_dp: true\npytorch_backend_config:\n enable_overlap_scheduler: true" > extra-llm-api-config.yml
srun -N 2 -w [NODES] \
--output=benchmark_2node.log \
--ntasks 16 --ntasks-per-node=8 \
--mpi=pmix --gres=gpu:8 \
--container-image=<CONTAINER_IMG> \
--container-mounts=/workspace:/workspace \
--container-workdir /workspace \
bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 --backend pytorch --max_batch_size 161 --max_num_tokens 1160 --tp_size 16 --ep_size 4 --kv_cache_free_gpu_memory_fraction 0.95 --extra_llm_api_options ./extra-llm-api-config.yml"
See the source code of trtllm-llmapi-launch
for more details.
Metrics Endpoint#
Note
This endpoint is beta maturity.
The statistics for the PyTorch backend are beta and not as comprehensive as those for the TensorRT backend.
Some fields, such as CPU memory usage, are not available for the PyTorch backend.
Enabling enable_iter_perf_stats
in the PyTorch backend can impact performance slightly, depending on the serving configuration.
The /metrics
endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details.
For the TensorRT backend, these statistics are enabled by default.
However, for the PyTorch backend, specified with the --backend pytorch
argument, you must explicitly enable iteration statistics logging by setting the enable_iter_perf_stats field in a YAML configuration file as shown in the following example:
# extra-llm-api-config.yml
pytorch_backend_config:
enable_iter_perf_stats: true
Then start the server and specify the --extra_llm_api_options
argument with the path to the YAML file as shown in the following example:
trtllm-serve <model> \
--extra_llm_api_options <path-to-extra-llm-api-config.yml> \
[--backend pytorch --tp_size <tp> --pp_size <pp> --ep_size <ep> --host <host> --port <port>]
After at least one inference request is sent to the server, you can fetch the runtime-iteration statistics by polling the /metrics endpoint:
curl -X GET http://<host>:<port>/metrics
Example Output
[
{
"gpuMemUsage": 56401920000,
"inflightBatchingStats": {
...
},
"iter": 1,
"iterLatencyMS": 16.505143404006958,
"kvCacheStats": {
...
},
"newActiveRequestsQueueLatencyMS": 0.0007503032684326172
}
]
Syntax#
trtllm-serve#
trtllm-serve [OPTIONS] COMMAND [ARGS]...
disaggregated#
Running server in disaggregated mode
trtllm-serve disaggregated [OPTIONS]
Options
- -c, --config_file <config_file>#
Specific option for disaggregated mode.
- -t, --server_start_timeout <server_start_timeout>#
Server start timeout
- -r, --request_timeout <request_timeout>#
Request timeout
disaggregated_mpi_worker#
Launching disaggregated MPI worker
trtllm-serve disaggregated_mpi_worker [OPTIONS]
Options
- -c, --config_file <config_file>#
Specific option for disaggregated mode.
- --log_level <log_level>#
The logging level.
- Options:
internal_error | error | warning | info | verbose | debug
serve#
Running an OpenAI API compatible server
MODEL: model name | HF checkpoint path | TensorRT engine path
trtllm-serve serve [OPTIONS] MODEL
Options
- --tokenizer <tokenizer>#
Path | Name of the tokenizer.Specify this value only if using TensorRT engine as model.
- --host <host>#
Hostname of the server.
- --port <port>#
Port of the server.
- --backend <backend>#
Set to ‘pytorch’ for pytorch path. Default is cpp path.
- Options:
pytorch
- --log_level <log_level>#
The logging level.
- Options:
internal_error | error | warning | info | verbose | debug
- --max_beam_width <max_beam_width>#
Maximum number of beams for beam search decoding.
- --max_batch_size <max_batch_size>#
Maximum number of requests that the engine can schedule.
- --max_num_tokens <max_num_tokens>#
Maximum number of batched input tokens after padding is removed in each batch.
- --max_seq_len <max_seq_len>#
Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config.
- --tp_size <tp_size>#
Tensor parallelism size.
- --pp_size <pp_size>#
Pipeline parallelism size.
- --ep_size <ep_size>#
expert parallelism size
- --gpus_per_node <gpus_per_node>#
Number of GPUs per node. Default to None, and it will be detected automatically.
- --kv_cache_free_gpu_memory_fraction <kv_cache_free_gpu_memory_fraction>#
Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers.
- --num_postprocess_workers <num_postprocess_workers>#
[Experimental] Number of workers to postprocess raw responses to comply with OpenAI protocol.
- --trust_remote_code#
Flag for HF transformers.
- --extra_llm_api_options <extra_llm_api_options>#
Path to a YAML file that overwrites the parameters specified by trtllm-serve.
Arguments
- MODEL#
Required argument