Experimental High-Level Python API and Server#

The experimental Python API wraps export, engine build, engine loading, generation, streaming, and OpenAI-compatible serving.

Status: Experimental. API may change between releases.

Prerequisites#

Complete the Installation Guide with the C++ runtime, Python bindings, and server dependencies enabled before proceeding. The examples below assume experimental.server and tensorrt_edgellm are importable from the active Python environment.

If the active environment was installed with base export dependencies only, install the server dependencies before building Python bindings or launching the server:

cd /path/to/TensorRT-Edge-LLM
pip install -r requirements-server.txt

Python API#

From a HuggingFace checkpoint:

from experimental.server import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen3-1.7B")
outputs = llm.generate(
    ["What is the capital of France?"],
    SamplingParams(temperature=0.7, max_tokens=128),
)
print(outputs[0].text)

From existing ONNX or engine directories:

from experimental.server import LLM

llm = LLM(onnx_dir="/path/to/llm_onnx")
llm = LLM(engine_dir="/path/to/llm_engine")

Streaming:

from experimental.server import LLM, SamplingParams

llm = LLM(engine_dir="/path/to/llm_engine")

for delta in llm.generate_stream(
    [{"role": "user", "content": "Tell me a story."}],
    SamplingParams(max_tokens=256),
):
    print(delta.text, end="", flush=True)

OpenAI-Compatible Server#

python -m experimental.server \
  --model Qwen/Qwen3-1.7B \
  --port 8000

Query:

curl -sN http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128}'

Streaming query:

curl -sN http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 128, "stream": true}'

Tool-aware query:

curl -sN http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "What is the weather in Paris?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get the current weather for a city.",
        "parameters": {
          "type": "object",
          "properties": {"city": {"type": "string"}},
          "required": ["city"]
        }
      }
    }],
    "tool_choice": "auto",
    "max_tokens": 128
  }'

To continue an agentic loop, include the previous assistant tool_calls and the matching tool response messages in the next request.

Tool response follow-up:

{
  "messages": [
    {"role": "user", "content": "What is the weather in Paris?"},
    {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_1",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"city\":\"Paris\"}"
        }
      }]
    },
    {
      "role": "tool",
      "tool_call_id": "call_1",
      "content": "{\"temperature\":22,\"unit\":\"celsius\"}"
    }
  ],
  "tools": [{
    "type": "function",
    "function": {"name": "get_weather", "parameters": {"type": "object"}}
  }]
}

Common Inputs#

LLM requires exactly one source:

Source

Meaning

model

HuggingFace model ID or local checkpoint; export, build, then load

onnx_dir

Existing ONNX directory; build then load

engine_dir

Existing engine directory; load only

For VLMs, also pass visual_onnx_dir or visual_engine_dir.

Sampling Parameters#

Parameter

Default

Description

temperature

0.7

Sampling temperature

top_p

0.9

Nucleus sampling threshold

top_k

50

Top-K sampling

max_tokens

2048

Maximum generated tokens

enable_thinking

False

Enables Qwen-style thinking output

disable_spec_decode

False

Disables EAGLE for one request

Tool Calls#

The OpenAI-compatible server accepts tools, tool_choice, assistant.tool_calls, and tool messages. Tool-aware requests are formatted with the model’s Hugging Face chat template before they are sent to the runtime.

tool_choice supports auto, none, required, and forced function choices. Malformed tools, unknown forced tools, and dangling tool_call_id values return a 400 response.

When the model returns a supported tool-call format, non-streaming responses include message.tool_calls and finish_reason: "tool_calls". Streaming responses include delta.tool_calls chunks.

EAGLE#

from experimental.server import LLM, SamplingParams

llm = LLM(
    eagle_engine_dir="/path/to/eagle/engines",
    draft_top_k=10,
    draft_step=6,
    verify_tree_size=60,
)

outputs = llm.generate(
    ["Explain quantum computing."],
    SamplingParams(max_tokens=256),
)

Endpoints#

Method

Path

Description

GET

/health

Health check

GET

/v1/models

List models

POST

/v1/chat/completions

Chat completions with optional SSE streaming

Notes#

  • Standard chat templates are applied in the C++ runtime. Tool-aware requests are formatted in Python with the model’s Hugging Face chat template.

  • Thinking output is returned in reasoning; final answer text is returned in content.

  • Supported finish reasons are stop, length, cancelled, and error.