Experimental High-Level Python API and Server#

The experimental Python API wraps export, engine build, engine loading, generation, streaming, and OpenAI-compatible serving.

Status: Experimental. API may change between releases.

Prerequisites#

Build the C++ runtime and pybind extension first:

mkdir -p build && cd build
cmake .. -DTRT_PACKAGE_DIR=$TRT_PACKAGE_DIR -DBUILD_PYTHON_BINDINGS=ON
make -j$(nproc)

Install server dependencies:

pip install pybind11 fastapi uvicorn

Run examples from the repository root.

Python API#

From a HuggingFace checkpoint:

from experimental.server import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen3-1.7B")
outputs = llm.generate(
    ["What is the capital of France?"],
    SamplingParams(temperature=0.7, max_tokens=128),
)
print(outputs[0].text)

From existing ONNX or engine directories:

from experimental.server import LLM

llm = LLM(onnx_dir="/path/to/llm_onnx")
llm = LLM(engine_dir="/path/to/llm_engine")

Streaming:

from experimental.server import LLM, SamplingParams

llm = LLM(engine_dir="/path/to/llm_engine")

for delta in llm.generate_stream(
    [{"role": "user", "content": "Tell me a story."}],
    SamplingParams(max_tokens=256),
):
    print(delta.text, end="", flush=True)

OpenAI-Compatible Server#

python -m experimental.server \
  --model Qwen/Qwen3-1.7B \
  --port 8000

Query:

curl -sN http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128}'

Streaming query:

curl -sN http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128, "stream": true}'

Common Inputs#

LLM requires exactly one source:

Source

Meaning

model

HuggingFace model ID or local checkpoint; export, build, then load

onnx_dir

Existing ONNX directory; build then load

engine_dir

Existing engine directory; load only

For VLMs, also pass visual_onnx_dir or visual_engine_dir.

Sampling Parameters#

Parameter

Default

Description

temperature

0.7

Sampling temperature

top_p

0.9

Nucleus sampling threshold

top_k

50

Top-K sampling

max_tokens

2048

Maximum generated tokens

enable_thinking

False

Enables Qwen-style thinking output

disable_spec_decode

False

Disables EAGLE for one request

EAGLE#

from experimental.server import LLM, SamplingParams

llm = LLM(
    eagle_engine_dir="/path/to/eagle/engines",
    draft_top_k=10,
    draft_step=6,
    verify_tree_size=60,
)

outputs = llm.generate(
    ["Explain quantum computing."],
    SamplingParams(max_tokens=256),
)

Endpoints#

Method

Path

Description

GET

/health

Health check

GET

/v1/models

List models

POST

/v1/chat/completions

Chat completions with optional SSE streaming

Notes#

  • Chat templates are applied in the C++ runtime, not in Python.

  • Thinking output is returned in reasoning; final answer text is returned in content.

  • Supported finish reasons are stop, length, cancelled, and error.