Experimental High-Level Python API and Server#
The experimental Python API wraps export, engine build, engine loading, generation, streaming, and OpenAI-compatible serving.
Status: Experimental. API may change between releases.
Prerequisites#
Build the C++ runtime and pybind extension first:
mkdir -p build && cd build
cmake .. -DTRT_PACKAGE_DIR=$TRT_PACKAGE_DIR -DBUILD_PYTHON_BINDINGS=ON
make -j$(nproc)
Install server dependencies:
pip install pybind11 fastapi uvicorn
Run examples from the repository root.
Python API#
From a HuggingFace checkpoint:
from experimental.server import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen3-1.7B")
outputs = llm.generate(
["What is the capital of France?"],
SamplingParams(temperature=0.7, max_tokens=128),
)
print(outputs[0].text)
From existing ONNX or engine directories:
from experimental.server import LLM
llm = LLM(onnx_dir="/path/to/llm_onnx")
llm = LLM(engine_dir="/path/to/llm_engine")
Streaming:
from experimental.server import LLM, SamplingParams
llm = LLM(engine_dir="/path/to/llm_engine")
for delta in llm.generate_stream(
[{"role": "user", "content": "Tell me a story."}],
SamplingParams(max_tokens=256),
):
print(delta.text, end="", flush=True)
OpenAI-Compatible Server#
python -m experimental.server \
--model Qwen/Qwen3-1.7B \
--port 8000
Query:
curl -sN http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128}'
Streaming query:
curl -sN http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128, "stream": true}'
Common Inputs#
LLM requires exactly one source:
Source |
Meaning |
|---|---|
|
HuggingFace model ID or local checkpoint; export, build, then load |
|
Existing ONNX directory; build then load |
|
Existing engine directory; load only |
For VLMs, also pass visual_onnx_dir or visual_engine_dir.
Sampling Parameters#
Parameter |
Default |
Description |
|---|---|---|
|
|
Sampling temperature |
|
|
Nucleus sampling threshold |
|
|
Top-K sampling |
|
|
Maximum generated tokens |
|
|
Enables Qwen-style thinking output |
|
|
Disables EAGLE for one request |
EAGLE#
from experimental.server import LLM, SamplingParams
llm = LLM(
eagle_engine_dir="/path/to/eagle/engines",
draft_top_k=10,
draft_step=6,
verify_tree_size=60,
)
outputs = llm.generate(
["Explain quantum computing."],
SamplingParams(max_tokens=256),
)
Endpoints#
Method |
Path |
Description |
|---|---|---|
|
|
Health check |
|
|
List models |
|
|
Chat completions with optional SSE streaming |
Notes#
Chat templates are applied in the C++ runtime, not in Python.
Thinking output is returned in
reasoning; final answer text is returned incontent.Supported finish reasons are
stop,length,cancelled, anderror.