trtllm-serve#

About#

The trtllm-serve command starts an OpenAI compatible server that supports the following endpoints:

/v1/models
/v1/completions
/v1/chat/completions

For information about the inference endpoints, refer to the OpenAI API Reference.

The server also supports the following endpoints:

/health
/metrics
/version

The metrics endpoint provides runtime-iteration statistics such as GPU memory use and inflight-batching details.

For encoder-only models (BERT-style classifiers, reward models, text-embedding models), the trtllm-serve embeddings subcommand starts a server that exposes an OpenAI-compatible /v1/embeddings endpoint with native dynamic batching. See Embeddings for details.

Starting a Server#

The following abbreviated command syntax shows the commonly used arguments to start a server:

trtllm-serve <model> [--tp_size <tp> --pp_size <pp> --ep_size <ep> --host <host> --port <port>]

For the full syntax and argument descriptions, refer to Syntax.

Inference Endpoints#

After you start the server, you can send inference requests through completions API, Chat API and Responses API, which are compatible with corresponding OpenAI APIs. We use TinyLlama-1.1B-Chat-v1.0 for examples in the following sections.

Chat API#

You can query Chat API with any http clients, a typical example is OpenAI Python client:

### :title OpenAI Chat Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="tensorrt_llm",
)

response = client.chat.completions.create(
    model="TinyLlama-1.1B-Chat-v1.0",
    messages=[{
        "role": "system",
        "content": "you are a helpful assistant"
    }, {
        "role": "user",
        "content": "Where is New York?"
    }],
    max_tokens=20,
)
print(response)

Another example uses curl:

#! /usr/bin/env bash

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "TinyLlama-1.1B-Chat-v1.0",
        "messages":[{"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": "Where is New York?"}],
        "max_tokens": 16,
        "temperature": 0
    }'

Completions API#

You can query Completions API with any http clients, a typical example is OpenAI Python client:

### :title OpenAI Completion Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="tensorrt_llm",
)

response = client.completions.create(
    model="TinyLlama-1.1B-Chat-v1.0",
    prompt="Where is New York?",
    max_tokens=20,
)
print(response)

Another example uses curl:

#! /usr/bin/env bash

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "TinyLlama-1.1B-Chat-v1.0",
        "prompt": "Where is New York?",
        "max_tokens": 16,
        "temperature": 0
    }'

Responses API#

You can query Responses API with any http clients, a typical example is OpenAI Python client:

### :title OpenAI Responses Client

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="tensorrt_llm",
)

response = client.responses.create(
    model="TinyLlama-1.1B-Chat-v1.0",
    input="Where is New York?",
    max_output_tokens=20,
)
print(response)

Another example uses curl:

#! /usr/bin/env bash

curl http://localhost:8000/v1/responses \
    -H "Content-Type: application/json" \
    -d '{
        "model": "TinyLlama-1.1B-Chat-v1.0",
        "input": "Where is New York?",
        "max_output_tokens": 16
    }'

More openai compatible examples can be found in the compatibility examples directory.

Multimodal Serving#

For multimodal models, you need to create a configuration file and start the server with additional options due to the following limitations:

TRT-LLM multimodal is currently not compatible with kv_cache_reuse
Multimodal models require chat_template, so only the Chat API is supported

To set up multimodal models:

First, create a configuration file:

cat >./config.yml<<EOF
kv_cache_config:
    enable_block_reuse: false
EOF

Then, start the server with the configuration file:

trtllm-serve Qwen/Qwen2-VL-7B-Instruct \
    --config ./config.yml

Multimodal Chat API#

You can query Completions API with any http clients, a typical example is OpenAI Python client:

Another example uses curl:

#! /usr/bin/env bash

# SINGLE IMAGE INFERENCE
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json"  \
    -d '{
        "model": "Qwen2.5-VL-3B-Instruct",
        "messages":[{
            "role": "system",
            "content": "You are a helpful assistant."
        }, {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Describe the natural environment in the image."
                },
                {
                    "type":"image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png"
                    }
                }
            ]
        }],
        "max_tokens": 64,
        "temperature": 0
    }'

# MULTI IMAGE INFERENCE
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen2.5-VL-3B-Instruct",
        "messages":[{
            "role": "system",
            "content": "You are a helpful assistant."
        }, {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text":"Tell me the difference between two images"
                },
                {
                    "type":"image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png"
                    }
                },
                {
                    "type":"image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png"
                    }
                }
            ]
        }],
        "max_tokens": 64,
        "temperature": 0
    }'

# SINGLE VIDEO INFERENCE
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen2.5-VL-3B-Instruct",
        "messages":[{
            "role": "system",
            "content": "You are a helpful assistant."
        }, {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text":"Tell me what you see in the video briefly."
                },
                {
                    "type":"video_url",
                    "video_url": {
                        "url": "https://huggingface.co/datasets/Efficient-Large-Model/VILA-inference-demos/resolve/main/OAI-sora-tokyo-walk.mp4"
                    }
                }
            ]
        }],
        "max_tokens": 64,
        "temperature": 0
    }'

Multimodal Modality Coverage#

TRT-LLM multimodal supports the following modalities and data types (depending on the model):

Text

No type specified:

{"role": "user", "content": "What's the capital of South Korea?"}

Explicit “text” type:

{"role": "user", "content": [{"type": "text", "text": "What's the capital of South Korea?"}]}

Image

Using “image_url” with URL:

{"role": "user", "content": [
    {"type": "text", "text": "What's in this image?"},
    {"type": "image_url", "image_url": {"url": "https://example.com/image.png"}}
]}

Using “image_url” with base64-encoded data:

{"role": "user", "content": [
    {"type": "text", "text": "What's in this image?"},
    {"type": "image_url", "image_url": {"url": "data:image/jpeg;base64,{image_base64}"}}
]}

Note

To convert images to base64-encoded format, use the utility function tensorrt_llm.utils.load_base64_image(). Refer to the load_base64_image utility for implementation details.

Image embeddings

It is also possible to directly provide the image embeddings to use by the multimodal model.

Using “image_embeds” with base64-encoded data:

{"role": "user", "content": [
    {"type": "text", "text": "What's in this image?"},
    {"type": "image_embeds", "image_embeds": {"data": "{image_embeddings_base64}"}}}
]}

Note

The contents of image_embeddings_base64 can be generated by base64-encoding the result of serializing a tensor via torch.save.

Video

Using “video_url”:

{"role": "user", "content": [
    {"type": "text", "text": "What's in this video?"},
    {"type": "video_url", "video_url": {"url": "https://example.com/video.mp4"}}
]}

Audio

Using “audio_url”:

{"role": "user", "content": [
    {"type": "text", "text": "What's in this audio?"},
    {"type": "audio_url", "audio_url": {"url": "https://example.com/audio.mp3"}}
]}

Visual Generation Serving#

trtllm-serve supports diffusion-based visual generation models (FLUX.1, FLUX.2, Wan2.1, Wan2.2) for image and video generation. When a diffusion model directory is provided (detected by the presence of model_index.json), the server automatically launches in visual generation mode with dedicated endpoints.

Note

VisualGen is in beta stage. APIs, supported models, and optimization options are actively evolving and may change in future releases.

# Video generation (Wan)
trtllm-serve Wan-AI/Wan2.2-T2V-A14B-Diffusers \
    --visual_gen_args config.yml

# Image generation (FLUX)
trtllm-serve black-forest-labs/FLUX.2-dev \
    --visual_gen_args config.yml

# Video generation (Cosmos3 hybrid checkpoint)
trtllm-serve nvidia/Cosmos3-Nano \
    --enable_visual_gen

For checkpoints that support both LLM and Visual Generation, such as Cosmos3, pass --enable_visual_gen to select the VisualGen runtime when --visual_gen_args is not specified. The --visual_gen_args flag accepts a YAML file that configures quantization, parallelism, and TeaCache. Available visual generation endpoints include /v1/images/generations, /v1/videos, /v1/videos/generations, and video management APIs.

For full details, see the ../../models/visual-generation.md feature documentation. Example client scripts are available in the examples/visual_gen/serve/ directory.

Multi-node Serving with Slurm#

You can deploy DeepSeek-V3 model across two nodes with Slurm and trtllm-serve

echo -e "enable_attention_dp: true\npytorch_backend_config:\n  enable_overlap_scheduler: true" > config.yml

srun -N 2 -w [NODES] \
    --output=benchmark_2node.log \
    --ntasks 16 --ntasks-per-node=8 \
    --mpi=pmix --gres=gpu:8 \
    --container-image=<CONTAINER_IMG> \
    --container-mounts=/workspace:/workspace \
    --container-workdir /workspace \
    bash -c "trtllm-llmapi-launch trtllm-serve deepseek-ai/DeepSeek-V3 --max_batch_size 161 --max_num_tokens 1160 --tp_size 16 --ep_size 4 --kv_cache_free_gpu_memory_fraction 0.95 --config ./config.yml"

See the source code of trtllm-llmapi-launch for more details.

Metrics Endpoint#

Note

The metrics endpoint for the default PyTorch backend are in beta and are not as comprehensive as those for the TensorRT backend.

Some fields, such as CPU memory usage, are not yet available for the PyTorch backend.

Enabling enable_iter_perf_stats in the PyTorch backend can slightly impact performance, depending on the serving configuration.

The /metrics endpoint provides runtime iteration statistics such as GPU memory usage and KV cache details.

For the default PyTorch backend, iteration statistics logging is enabled by setting the enable_iter_perf_stats field in a YAML file:

# extra_llm_config.yaml
enable_iter_perf_stats: true

Start the server and specify the --config argument with the path to the YAML file:

trtllm-serve "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --config config.yaml

After sending at least one inference request to the server, you can fetch runtime iteration statistics by polling the /metrics endpoint. Since the statistics are stored in an internal queue and removed once retrieved, it’s recommended to poll the endpoint shortly after each request and store the results if needed.

curl -X GET http://localhost:8000/metrics

Example output:

[
    {
        "gpuMemUsage": 76665782272,
        "iter": 154,
        "iterLatencyMS": 7.00688362121582,
        "kvCacheStats": {
            "allocNewBlocks": 3126,
            "allocTotalBlocks": 3126,
            "cacheHitRate": 0.00128,
            "freeNumBlocks": 101253,
            "maxNumBlocks": 101256,
            "missedBlocks": 3121,
            "reusedBlocks": 4,
            "tokensPerBlock": 32,
            "usedNumBlocks": 3
        },
        "numActiveRequests": 1
        ...
    }
]

Configuring with YAML Files#

You can configure various options of trtllm-serve using YAML files by setting the --config option to the path of a YAML file. Explicit CLI flags take precedence over values in the YAML; un-set CLI flags fall back to the YAML.

Note

Non-breaking: --config <file.yaml> is the preferred flag for passing a YAML configuration file. Existing workflows using --extra_llm_api_options <file.yaml> continue to work; it is an equivalent alias.

The yaml file is configuration of tensorrt_llm.llmapi.LlmArgs, the class has multiple levels of hierarchy, to configure the top level arguments like max_batch_size, the yaml file should be like:

max_batch_size: 8

To configure the nested level arguments like moe_config.backend, the yaml file should be like:

moe_config:
    backend: CUTLASS

Syntax#

This syntax section lists all command line arguments for trtllm-serve’s subcommands. Some of the arguments are accompanied with a stability tag indicating their development status. Refer to our API Reference for details

trtllm-serve#

Usage

trtllm-serve [OPTIONS] COMMAND [ARGS]...

disaggregated#

Running server in disaggregated mode

Usage

trtllm-serve disaggregated [OPTIONS]

Options

-c, --config, --config_file <config_file>#: beta Path to the disaggregated serving configuration YAML file.

-m, --metadata_server_config_file <metadata_server_config_file>#: prototype Path to metadata server config file

-t, --server_start_timeout <server_start_timeout>#: beta Server start timeout

-r, --request_timeout <request_timeout>#: beta Request timeout

-l, --log_level <log_level>#

beta The logging level.

Options:: internal_error | error | warning | info | verbose | debug | trace

-s, --schedule_style <schedule_style>#

beta The schedule style for the disaggregated server.

Options:: context_first | generation_first

--metrics-log-interval <metrics_log_interval>#: deprecated [Deprecated] The interval of logging metrics in seconds. This option is not connected to any functionality and will be removed in a future release.

disaggregated_mpi_worker#

Launching disaggregated MPI worker

Usage

trtllm-serve disaggregated_mpi_worker [OPTIONS]

Options

-c, --config, --config_file <config_file>#: beta Path to the disaggregated serving configuration YAML file.

--log_level <log_level>#

beta The logging level.

Options:: internal_error | error | warning | info | verbose | debug | trace

embeddings#

Run an OpenAI-compatible /v1/embeddings server for encoder-only models.

Coalesces concurrent requests with a dynamic batcher and serves them through the synchronous llm.encode() fast path (no KV cache / sampler / scheduler). Single-GPU only: the command does not expose tensor/pipeline parallelism.

MODEL: model name | HF checkpoint path

Usage

trtllm-serve embeddings [OPTIONS] MODEL

Options

--host <host>#: Hostname of the server.

--port <port>#: Port of the server.

--log_level <log_level>#

The logging level.

Options:: internal_error | error | warning | info | verbose | debug | trace

--max_batch_size <max_batch_size>#: Maximum batch size coalesced into a single encode() call.

--max_num_tokens <max_num_tokens>#: Maximum number of batched input tokens in each encode() call.

--max_queue_delay <max_queue_delay>#: Dynamic-batching hold window in seconds: how long an incoming request waits for others to join its batch before being dispatched (mirrors Triton’s max_queue_delay_microseconds).

--max_queue_size <max_queue_size>#: Maximum number of in-flight queued requests; further requests are rejected with HTTP 429 (mirrors Triton’s max_queue_size).

--trust_remote_code#: Flag for HF transformers.

--config, --extra_llm_api_options <extra_llm_api_options>#: Path to a YAML configuration file. Explicit CLI flags take precedence over values in this file.

--hf_revision, --revision <revision>#: The revision to use for the HuggingFace model (branch name, tag name, or commit id).

--metadata_server_config_file <metadata_server_config_file>#: Path to metadata server config file

--telemetry, --no-telemetry#: Enable or disable anonymous usage telemetry collection.

Arguments

MODEL#: Required argument

mm_embedding_serve#

Running an OpenAI API compatible server

MODEL: model name | HF checkpoint path | TensorRT engine path

Usage

trtllm-serve mm_embedding_serve [OPTIONS] MODEL

Options

--host <host>#: beta Hostname of the server.

--port <port>#: beta Port of the server.

--log_level <log_level>#

beta The logging level.

Options:: internal_error | error | warning | info | verbose | debug | trace

--max_batch_size <max_batch_size>#: beta Maximum number of requests that the engine can schedule.

--max_num_tokens <max_num_tokens>#: beta Maximum number of batched input tokens after padding is removed in each batch.

--gpus_per_node <gpus_per_node>#: beta Number of GPUs per node. Default to None, and it will be detected automatically.

--trust_remote_code#: beta Flag for HF transformers.

--config, --extra_encoder_options <extra_encoder_options>#: prototype Path to a YAML configuration file. Explicit CLI flags take precedence over values in this file. Prefer –config over –extra_encoder_options.

--hf_revision, --revision <revision>#: beta The revision to use for the HuggingFace model (branch name, tag name, or commit id).

--free_gpu_memory_fraction <free_gpu_memory_fraction>#: beta Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers.

--tensor_parallel_size, --tp_size <tensor_parallel_size>#: beta Tensor parallelism size.

--metadata_server_config_file <metadata_server_config_file>#: prototype Path to metadata server config file

--allow_request_chat_template#: prototype Allow clients to supply per-request chat_template values. Only enable this for trusted clients.

--telemetry, --no-telemetry#: beta Enable or disable anonymous usage telemetry collection.

Arguments

MODEL#: Required argument

serve#

Running an OpenAI API compatible server

MODEL: model name | HF checkpoint path | TensorRT engine path

Usage

trtllm-serve serve [OPTIONS] MODEL

Options

--tokenizer <tokenizer>#: beta Path or name of the tokenizer. When using the PyTorch backend, this replaces the default HuggingFace tokenizer.

--custom_tokenizer <custom_tokenizer>#: prototype Custom tokenizer type: alias (e.g., ‘deepseek_v32’) or Python import path (e.g., ‘tensorrt_llm.tokenizer.deepseek_v32.DeepseekV32Tokenizer’).

--post_processor_hook <post_processor_hook>#: prototype Python import path of a user post-processing hook applied after detokenization and before the per-endpoint response formatter (e.g. ‘my_pkg.guardrail.MyPostProcessorHook’). The class must be importable and picklable, take no constructor arguments, and be callable as ‘__call__(chunk) -> verdict’ (see tensorrt_llm.executor.postprocessor_hook). It runs once per output, per streaming chunk, and may rewrite, suppress, or terminate the output; it owns its own per-request state.

--host <host>#: beta Hostname of the server.

--port <port>#: beta Port of the server.

--backend <backend>#

beta The backend to use to serve the model. Default is pytorch backend.

Options:: pytorch | _autodeploy

--custom_module_dirs <custom_module_dirs>#: prototype Paths to custom module directories to import.

--log_level <log_level>#

beta The logging level.

Options:: internal_error | error | warning | info | verbose | debug | trace

--max_beam_width <max_beam_width>#: beta Maximum number of beams for beam search decoding.

--max_batch_size <max_batch_size>#: beta Maximum number of requests that the engine can schedule.

--max_num_tokens <max_num_tokens>#: beta Maximum number of batched input tokens after padding is removed in each batch.

--max_seq_len <max_seq_len>#: beta Maximum total length of one request, including prompt and outputs. If unspecified, the value is deduced from the model config.

--tensor_parallel_size, --tp_size <tensor_parallel_size>#: beta Tensor parallelism size.

--pipeline_parallel_size, --pp_size <pipeline_parallel_size>#: beta Pipeline parallelism size.

--context_parallel_size, --cp_size <context_parallel_size>#: beta Context parallelism size.

--moe_expert_parallel_size, --ep_size <moe_expert_parallel_size>#: beta expert parallelism size

--moe_cluster_parallel_size, --cluster_size <moe_cluster_parallel_size>#: deprecated [Deprecated] Expert cluster parallelism size. This option is no longer supported and will be removed in a future release.

--gpus_per_node <gpus_per_node>#: beta Number of GPUs per node. Default to None, and it will be detected automatically.

--free_gpu_memory_fraction, --kv_cache_free_gpu_memory_fraction <free_gpu_memory_fraction>#: beta Free GPU memory fraction reserved for KV Cache, after allocating model weights and buffers.

--kv_cache_dtype <kv_cache_dtype>#

prototype KV cache quantization dtype for PyTorch backend. ‘auto’ uses checkpoint/model metadata; explicit values force override.

Options:: auto | fp8 | nvfp4

--num_postprocess_workers <num_postprocess_workers>#: prototype Number of workers to postprocess raw responses to comply with OpenAI protocol.

--num_input_processor_workers <num_input_processor_workers>#: prototype Size of the dedicated thread pool that runs the HF input processor (multimodal preprocess) on the chat and completion endpoints.

--num_media_load_workers <num_media_load_workers>#: prototype Size of the dedicated thread pool that decodes media payloads (image / video / audio) for multimodal requests.

--trust_remote_code#: beta Flag for HF transformers.

--hf_revision, --revision <revision>#: beta The revision to use for the HuggingFace model (branch name, tag name, or commit id). Prefer –hf_revision over –revision.

--config, --extra_llm_api_options <extra_llm_api_options>#: prototype Path to a YAML configuration file. Explicit CLI flags take precedence over values in this file. Can be specified as either –config or –extra_llm_api_options.

--reasoning_parser <reasoning_parser>#

prototype Specify the parser for reasoning models. Use ‘auto’ to automatically select based on the model.

Options:: auto | minimax_m2_append_think | minimax_m2 | qwen3_5 | qwen3 | laguna | deepseek-r1 | deepseek_v4 | minimax_m3 | nano-v3 | nemotron-v3 | gemma4 | kimi_k25 | kimi_k2

--tool_parser <tool_parser>#

prototype Specify the parser for tool models. Use ‘auto’ to automatically select based on the model.

Options:: auto | qwen3 | qwen3_coder | kimi_k2 | deepseek_v3 | deepseek_v31 | deepseek_v32 | deepseek_v4 | gemma4 | glm4 | glm47 | minimax_m2 | minimax_m3 | poolside_v1

--metadata_server_config_file <metadata_server_config_file>#: prototype Path to metadata server config file

--server_role <server_role>#: prototype Server role for disaggregated serving. CONTEXT=prefill (prompt processing), GENERATION=decode (token generation), MM_ENCODER=multimodal encoder, VISUAL_GEN=visual generation. Required when using service registry.

--fail_fast_on_attention_window_too_large#: deprecated [Deprecated] Exit with runtime error when attention window is too large to fit even a single sequence in the KV cache. Now defaults to True. This flag only affects the TRT backend and will be removed in a future release.

--otlp_traces_endpoint <otlp_traces_endpoint>#: prototype Target URL to which OpenTelemetry traces will be sent.

--telemetry, --no-telemetry#: beta Enable or disable anonymous usage telemetry collection.

--disagg_cluster_uri <disagg_cluster_uri>#: prototype URI of the disaggregated cluster.

--enable_chunked_prefill#: prototype Enable chunked prefill

--enable_attention_dp#: beta Enable attention data parallel.

--media_io_kwargs <media_io_kwargs>#: prototype Keyword arguments for media I/O as a JSON string. Keys are modality names (“video”, “image”, “audio”) whose values are dicts of keyword arguments forwarded to the corresponding loader. Example: ‘{“video”: {“extract_audio”: true, “num_frames”: 16}}’ to enable audio extraction from video files.

--video_pruning_rate <video_pruning_rate>#: prototype Pruning rate for video frames in multimodal models. Applied by Efficient Video Sampling (EVS). None disables EVS, values in [0, 1) enable pruning.

--chat_template <chat_template>#: prototype Specify a custom chat template. Can be a file path or one-liner template string

--allow_request_chat_template#: prototype Allow clients to supply per-request chat_template values. Only enable this for trusted clients.

--middleware <middleware>#: prototype FastAPI middleware import path to add to the server app. Can be specified multiple times. Each value must point to either a middleware class or an async HTTP middleware function.

--grpc#: prototype Run gRPC server instead of OpenAI HTTP server. gRPC server accepts pre-tokenized requests and returns raw token IDs.

--served_model_name <served_model_name>#: prototype The model name used in the API. If not specified, the model path is used as the model name. This is useful when the model path is long or when you want to expose a custom name to clients.

--enable_visual_gen#: prototype Enable VisualGen runtime for model checkpoints that support both LLM and Visual Generation. Not required if –visual_gen_args specified or the model supports Visual Generation only.

--visual_gen_args <visual_gen_args>#: prototype Path to a YAML file with VisualGen engine args.

--agent_percentage <agent_percentage>#: prototype The percentage of agent requests to schedule. Defaults to 0.0. Should be between 0.0 and 1.0.

--agent_types <agent_types>#: prototype Types of agents to schedule. Now Only Support Open Deep Research agent.

Arguments

MODEL#: Required argument

Besides the above examples, trtllm-serve is also used as an entrypoint for performance benchmarking. Please refer to Performance Benchmarking with `trtllm-serve <NVIDIA/TensorRT-LLM>` for more details.