Experimental High-Level Python API and Server#
The experimental Python API wraps export, engine build, engine loading, generation, streaming, and OpenAI-compatible serving.
Status: Experimental. API may change between releases.
Prerequisites#
Complete the Installation Guide with the C++ runtime, Python bindings, and server dependencies enabled before proceeding. The examples below assume experimental.server and tensorrt_edgellm are importable from the active Python environment.
If the active environment was installed with base export dependencies only, install the server dependencies before building Python bindings or launching the server:
cd /path/to/TensorRT-Edge-LLM
pip install -r requirements-server.txt
Python API#
From a HuggingFace checkpoint:
from experimental.server import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen3-1.7B")
outputs = llm.generate(
["What is the capital of France?"],
SamplingParams(temperature=0.7, max_tokens=128),
)
print(outputs[0].text)
From existing ONNX or engine directories:
from experimental.server import LLM
llm = LLM(onnx_dir="/path/to/llm_onnx")
llm = LLM(engine_dir="/path/to/llm_engine")
Streaming:
from experimental.server import LLM, SamplingParams
llm = LLM(engine_dir="/path/to/llm_engine")
for delta in llm.generate_stream(
[{"role": "user", "content": "Tell me a story."}],
SamplingParams(max_tokens=256),
):
print(delta.text, end="", flush=True)
OpenAI-Compatible Server#
python -m experimental.server \
--model Qwen/Qwen3-1.7B \
--port 8000
Query:
curl -sN http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128}'
Streaming query:
curl -sN http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128, "stream": true}'
Tool-aware query:
curl -sN http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "What is the weather in Paris?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}],
"tool_choice": "auto",
"max_tokens": 128
}'
To continue an agentic loop, include the previous assistant tool_calls and
the matching tool response messages in the next request.
Tool response follow-up:
{
"messages": [
{"role": "user", "content": "What is the weather in Paris?"},
{
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_1",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"city\":\"Paris\"}"
}
}]
},
{
"role": "tool",
"tool_call_id": "call_1",
"content": "{\"temperature\":22,\"unit\":\"celsius\"}"
}
],
"tools": [{
"type": "function",
"function": {"name": "get_weather", "parameters": {"type": "object"}}
}]
}
Common Inputs#
LLM requires exactly one source:
Source |
Meaning |
|---|---|
|
HuggingFace model ID or local checkpoint; export, build, then load |
|
Existing ONNX directory; build then load |
|
Existing engine directory; load only |
For VLMs and models that support audio (Qwen3-Omni, Qwen3-ASR, Nemotron-Omni), also pass visual_onnx_dir or visual_engine_dir.
Audio Input#
The server accepts three OpenAI-compatible content forms inside user messages for models that support audio (Qwen3-Omni, Qwen3-ASR, Nemotron-Omni):
{"type": "input_audio", "input_audio": {"data": "<base64>", "format": "wav"}}
{"type": "audio_url", "audio_url": {"url": "file:///abs/path | data:audio/...;base64,..."}}
{"type": "audio", "audio": "<local path>"}
http(s):// URLs are rejected by design — host the audio locally and use
file://, or inline the bytes as base64 via input_audio. Supported
containers: .wav, .mp3, .flac. The server decodes the container
in-process via vendored miniaudio and the audio runner extracts the
mel-spectrogram in C++ (no HF transformers feature extractor or Python
preprocessing step is required). The model-appropriate feature extractor
is selected automatically from the engine’s audio/config.json::model_type:
whisper for Qwen3-Omni / Qwen3-ASR, parakeet for Nemotron-Omni.
Example (base64-inline):
B64=$(base64 -w0 sample.wav)
curl -X POST http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{\"model\":\"local\",\"messages\":[{\"role\":\"user\",\"content\":[ \
{\"type\":\"input_audio\",\"input_audio\":{\"data\":\"$B64\",\"format\":\"wav\"}}, \
{\"type\":\"text\",\"text\":\"Transcribe.\"}]}],\"max_tokens\":128}"
Sampling Parameters#
Parameter |
Default |
Description |
|---|---|---|
|
|
Sampling temperature |
|
|
Nucleus sampling threshold |
|
|
Top-K sampling |
|
|
Maximum generated tokens |
|
|
Enables Qwen-style thinking output |
|
|
Disables EAGLE for one request |
Tool Calls#
The OpenAI-compatible server accepts tools, tool_choice,
assistant.tool_calls, and tool messages. Tool-aware requests are formatted
with the model’s Hugging Face chat template before they are sent to the runtime.
tool_choice supports auto, none, required, and forced function choices.
Malformed tools, unknown forced tools, and dangling tool_call_id values return
a 400 response.
When the model returns a supported tool-call format, non-streaming responses
include message.tool_calls and finish_reason: "tool_calls". Streaming
responses include delta.tool_calls chunks.
EAGLE#
from experimental.server import LLM, SamplingParams
llm = LLM(
engine_dir="/path/to/base/engine",
eagle_engine_dir="/path/to/eagle/engines",
draft_top_k=10,
draft_step=6,
verify_tree_size=60,
)
outputs = llm.generate(
["Explain quantum computing."],
SamplingParams(max_tokens=256),
)
Endpoints#
Method |
Path |
Description |
|---|---|---|
|
|
Health check |
|
|
List models |
|
|
Chat completions with optional SSE streaming |
Notes#
Standard chat templates are applied in the C++ runtime. Tool-aware requests are formatted in Python with the model’s Hugging Face chat template.
Thinking output is returned in
reasoning; final answer text is returned incontent.Supported finish reasons are
stop,length,cancelled, anderror.