Serving with trtllm-serve#
AutoDeploy integrates with the OpenAI-compatible trtllm-serve
CLI so you can expose AutoDeploy-optimized models over HTTP without writing server code. This page shows how to launch the server with the AutoDeploy backend, configure it via YAML, and validate with a simple request.
Quick start#
Launch trtllm-serve
with the AutoDeploy backend by setting --backend _autodeploy
:
trtllm-serve \
meta-llama/Llama-3.1-8B-Instruct \
--backend _autodeploy
model
: HF name or local path--backend _autodeploy
: uses AutoDeploy runtime
Once the server is ready, test with an OpenAI-compatible request:
curl -s http://localhost:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages":[{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Where is New York? Tell me in a single sentence."}],
"max_tokens": 32
}'
Configuration via YAML#
Use --extra_llm_api_options
to supply a YAML file that augments or overrides server/runtime settings.
trtllm-serve \
meta-llama/Llama-3.1-8B \
--backend _autodeploy \
--extra_llm_api_options autodeploy_config.yaml
Example autodeploy_config.yaml
:
# Compilation backend for AutoDeploy
compile_backend: torch-opt # options: torch-simple, torch-compile, torch-cudagraph, torch-opt
# Runtime engine
runtime: trtllm # options: trtllm, demollm
# Model loading
skip_loading_weights: false # set true for architecture-only perf runs
# KV cache memory
free_mem_ratio: 0.8 # fraction of free GPU mem for KV cache
# CUDA graph optimization
cuda_graph_batch_sizes: [1, 2, 4, 8, 16, 32, 64]
# Attention backend
attn_backend: flashinfer # recommended for best performance
Limitations and tips#
KV cache block reuse is disabled automatically for AutoDeploy backend
AutoDeploy backend doesn’t yet support disaggregated serving. WIP
For best performance:
Prefer
compile_backend: torch-opt
Use
attn_backend: flashinfer
Set realistic
cuda_graph_batch_sizes
that match expected trafficTune
free_mem_ratio
to 0.8–0.9