trtllm-serve#

About#

The trtllm-serve command starts an OpenAI compatible server that supports the following endpoints:

  • /v1/models

  • /v1/completions

  • /v1/chat/completions

For information about the inference endpoints, refer to the OpenAI API Reference.

The server also supports the following endpoints:

  • /health

  • /metrics

  • /version

The metrics endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details.

Starting a Server#

The following abbreviated command syntax shows the commonly used arguments to start a server:

trtllm-serve <model> [--backend pytorch --tp_size <tp> --pp_size <pp> --ep_size <ep> --host <host> --port <port>]

For the full syntax and argument descriptions, refer to syntax.

Inference Endpoints#

After you start the server, you can send inference requests through completions API and Chat API, which are compatible with corresponding OpenAI APIs.

Chat API#

You can query Chat API with any http clients, a typical example is OpenAI Python client:

 1### OpenAI Chat Client
 2
 3from openai import OpenAI
 4
 5client = OpenAI(
 6    base_url="http://localhost:8000/v1",
 7    api_key="tensorrt_llm",
 8)
 9
10response = client.chat.completions.create(
11    model="TinyLlama-1.1B-Chat-v1.0",
12    messages=[{
13        "role": "system",
14        "content": "you are a helpful assistant"
15    }, {
16        "role": "user",
17        "content": "Where is New York?"
18    }],
19    max_tokens=20,
20)
21print(response)

Another example uses curl:

 1#! /usr/bin/env bash
 2
 3curl http://localhost:8000/v1/chat/completions \
 4    -H "Content-Type: application/json" \
 5    -d '{
 6        "model": TinyLlama-1.1B-Chat-v1.0,
 7        "messages":[{"role": "system", "content": "You are a helpful assistant."},
 8                    {"role": "user", "content": "Where is New York?"}],
 9        "max_tokens": 16,
10        "temperature": 0
11    }'

Completions API#

You can query Completions API with any http clients, a typical example is OpenAI Python client:

 1### OpenAI Completion Client
 2
 3from openai import OpenAI
 4
 5client = OpenAI(
 6    base_url="http://localhost:8000/v1",
 7    api_key="tensorrt_llm",
 8)
 9
10response = client.completions.create(
11    model="TinyLlama-1.1B-Chat-v1.0",
12    prompt="Where is New York?",
13    max_tokens=20,
14)
15print(response)

Another example uses curl:

 1#! /usr/bin/env bash
 2
 3curl http://localhost:8000/v1/completions \
 4    -H "Content-Type: application/json" \
 5    -d '{
 6        "model": TinyLlama-1.1B-Chat-v1.0,
 7        "prompt": "Where is New York?",
 8        "max_tokens": 16,
 9        "temperature": 0
10    }'

Metrics Endpoint#

Note

This endpoint is beta maturity.

The statistics for the PyTorch backend are beta and not as comprehensive as those for the TensorRT backend.

Some fields, such as CPU memory usage, are not available for the PyTorch backend.

Enabling enable_iter_perf_stats in the PyTorch backend can impact performance slightly, depending on the serving configuration.

The /metrics endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details. For the TensorRT backend, these statistics are enabled by default. However, for the PyTorch backend, specified with the --backend pytorch argument, you must explicitly enable iteration statistics logging by setting the enable_iter_perf_stats field in a YAML configuration file as shown in the following example:

# extra-llm-api-config.yml
pytorch_backend_config:
 enable_iter_perf_stats: true

Then start the server and specify the --extra_llm_api_options argument with the path to the YAML file as shown in the following example:

trtllm-serve <model> \
  --extra_llm_api_options <path-to-extra-llm-api-config.yml> \
  [--backend pytorch --tp_size <tp> --pp_size <pp> --ep_size <ep> --host <host> --port <port>]

After at least one inference request is sent to the server, you can fetch the runtime-iteration statistics by polling the /metrics endpoint:

curl -X GET http://<host>:<port>/metrics

Example Output

[
    {
        "gpuMemUsage": 56401920000,
     "inflightBatchingStats": {
         ...
     },
     "iter": 1,
     "iterLatencyMS": 16.505143404006958,
     "kvCacheStats": {
         ...
     },
     "newActiveRequestsQueueLatencyMS": 0.0007503032684326172
 }

]

Syntax#

trtllm-serve#

trtllm-serve [OPTIONS] COMMAND [ARGS]...

Commands

disaggregated

Running server in disaggregated mode

disaggregated_mpi_worker

Launching disaggregated MPI worker

serve

Running an OpenAI API compatible server