Experimental High-Level Python API and Server#
The experimental Python API wraps export, engine build, engine loading, generation, streaming, and OpenAI-compatible serving.
Status: Experimental. API may change between releases.
Prerequisites#
Complete the Installation Guide with the C++ runtime, Python bindings, and server dependencies enabled before proceeding. The examples below assume experimental.server and tensorrt_edgellm are importable from the active Python environment.
If the active environment was installed with base export dependencies only, install the server dependencies before building Python bindings or launching the server:
cd /path/to/TensorRT-Edge-LLM
pip install -r requirements-server.txt
Python API#
From a HuggingFace checkpoint:
from experimental.server import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen3-1.7B")
outputs = llm.generate(
["What is the capital of France?"],
SamplingParams(temperature=0.7, max_tokens=128),
)
print(outputs[0].text)
From existing ONNX or engine directories:
from experimental.server import LLM
llm = LLM(onnx_dir="/path/to/llm_onnx")
llm = LLM(engine_dir="/path/to/llm_engine")
Streaming:
from experimental.server import LLM, SamplingParams
llm = LLM(engine_dir="/path/to/llm_engine")
for delta in llm.generate_stream(
[{"role": "user", "content": "Tell me a story."}],
SamplingParams(max_tokens=256),
):
print(delta.text, end="", flush=True)
OpenAI-Compatible Server#
python -m experimental.server \
--model Qwen/Qwen3-1.7B \
--port 8000
Query:
curl -sN http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128}'
Streaming query:
curl -sN http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128, "stream": true}'
Tool-aware query:
curl -sN http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "What is the weather in Paris?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}],
"tool_choice": "auto",
"max_tokens": 128
}'
To continue an agentic loop, include the previous assistant tool_calls and
the matching tool response messages in the next request.
Tool response follow-up:
{
"messages": [
{"role": "user", "content": "What is the weather in Paris?"},
{
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_1",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"city\":\"Paris\"}"
}
}]
},
{
"role": "tool",
"tool_call_id": "call_1",
"content": "{\"temperature\":22,\"unit\":\"celsius\"}"
}
],
"tools": [{
"type": "function",
"function": {"name": "get_weather", "parameters": {"type": "object"}}
}]
}
Common Inputs#
LLM requires exactly one source:
Source |
Meaning |
|---|---|
|
HuggingFace model ID or local checkpoint; export, build, then load |
|
Existing ONNX directory; build then load |
|
Existing engine directory; load only |
For VLMs, also pass visual_onnx_dir or visual_engine_dir.
Sampling Parameters#
Parameter |
Default |
Description |
|---|---|---|
|
|
Sampling temperature |
|
|
Nucleus sampling threshold |
|
|
Top-K sampling |
|
|
Maximum generated tokens |
|
|
Enables Qwen-style thinking output |
|
|
Disables EAGLE for one request |
Tool Calls#
The OpenAI-compatible server accepts tools, tool_choice,
assistant.tool_calls, and tool messages. Tool-aware requests are formatted
with the model’s Hugging Face chat template before they are sent to the runtime.
tool_choice supports auto, none, required, and forced function choices.
Malformed tools, unknown forced tools, and dangling tool_call_id values return
a 400 response.
When the model returns a supported tool-call format, non-streaming responses
include message.tool_calls and finish_reason: "tool_calls". Streaming
responses include delta.tool_calls chunks.
EAGLE#
from experimental.server import LLM, SamplingParams
llm = LLM(
eagle_engine_dir="/path/to/eagle/engines",
draft_top_k=10,
draft_step=6,
verify_tree_size=60,
)
outputs = llm.generate(
["Explain quantum computing."],
SamplingParams(max_tokens=256),
)
Endpoints#
Method |
Path |
Description |
|---|---|---|
|
|
Health check |
|
|
List models |
|
|
Chat completions with optional SSE streaming |
Notes#
Standard chat templates are applied in the C++ runtime. Tool-aware requests are formatted in Python with the model’s Hugging Face chat template.
Thinking output is returned in
reasoning; final answer text is returned incontent.Supported finish reasons are
stop,length,cancelled, anderror.