Python API Reference#
This section provides documentation for the TensorRT Edge-LLM Python package.
Python workflows use tensorrt_edgellm.quantization for checkpoint
quantization, tensorrt_edgellm for ONNX export, and experimental.server for
the experimental high-level API and OpenAI-compatible server.
Experimental Server#
vLLM-style inference server for TensorRT Edge-LLM.
Public API:
from experimental.server import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen3-1.7B")
outputs = llm.generate(["Hello!"], SamplingParams(max_tokens=64))
print(outputs[0].text)
# Or start an OpenAI-compatible server:
llm.serve(port=8000)
- class experimental.server.LLM(
- model: str = '',
- *,
- onnx_dir: str = '',
- visual_onnx_dir: str = '',
- engine_dir: str = '',
- visual_engine_dir: str = '',
- max_input_len: int = 4096,
- max_batch_size: int = 1,
- max_kv_cache_capacity: int = 8192,
- eagle_engine_dir: str = '',
- draft_top_k: int = 10,
- draft_step: int = 6,
- verify_tree_size: int = 60,
Bases:
objectvLLM-style entry point for TensorRT Edge-LLM inference.
Three initialization modes (exactly one of
model,onnx_dir, orengine_dirmust be provided):HuggingFace checkpoint — exports ONNX, builds engine, loads:
llm = LLM(model="Qwen/Qwen3-1.7B")
ONNX directory — builds engine from ONNX, loads:
llm = LLM(onnx_dir="/path/to/onnx")
Pre-built engine — loads directly:
llm = LLM(engine_dir="/path/to/engine") llm = LLM(engine_dir="...", visual_engine_dir="...")
See
experimental.server.engine_layoutfor the expected directory layouts.- __init__(
- model: str = '',
- *,
- onnx_dir: str = '',
- visual_onnx_dir: str = '',
- engine_dir: str = '',
- visual_engine_dir: str = '',
- max_input_len: int = 4096,
- max_batch_size: int = 1,
- max_kv_cache_capacity: int = 8192,
- eagle_engine_dir: str = '',
- draft_top_k: int = 10,
- draft_step: int = 6,
- verify_tree_size: int = 60,
- generate(
- prompts: str | List[str] | List[List[Dict[str, Any]]],
- sampling_params: SamplingParams | None = None,
- *,
- tools: Sequence[Dict[str, Any]] | None = None,
- tool_choice: str | Dict[str, Any] | None = None,
Generate completions for the given prompts.
- Parameters:
prompts – A single prompt string, a list of prompt strings, or a list of OpenAI-style message lists.
sampling_params – Sampling configuration. Defaults to
SamplingParams().tools – Optional OpenAI-compatible tool definitions.
tool_choice – Optional OpenAI-compatible tool choice.
- Returns:
List of
CompletionOutputobjects, one per prompt.
- chat(
- messages: List[Dict[str, Any]],
- sampling_params: SamplingParams | None = None,
- *,
- tools: Sequence[Dict[str, Any]] | None = None,
- tool_choice: str | Dict[str, Any] | None = None,
Single-turn chat completion (convenience wrapper).
- Parameters:
messages – OpenAI-style message list.
sampling_params – Sampling configuration.
tools – Optional OpenAI-compatible tool definitions.
tool_choice – Optional OpenAI-compatible tool choice.
- Returns:
A single
CompletionOutput.
- generate_stream(
- messages: List[Dict[str, Any]],
- sampling_params: SamplingParams | None = None,
- *,
- tools: Sequence[Dict[str, Any]] | None = None,
- tool_choice: str | Dict[str, Any] | None = None,
Stream generation deltas for a single message list.
Runs
handleRequestin a background thread with aStreamChannelattached, yieldingStreamDeltaobjects as tokens are produced.
- serve(host: str = '0.0.0.0', port: int = 8000) None[source]#
Start an OpenAI-compatible HTTP server.
- Parameters:
host – Bind address.
port – Bind port.
- property model_dir: str#
Path to the resolved model checkpoint.
- property engine_dir: str#
Path to the TensorRT engine directory.
- property has_draft_model: bool#
Whether Eagle speculative decoding is active.
- class experimental.server.SamplingParams(
- temperature: float = 0.7,
- top_p: float = 0.9,
- top_k: int = 50,
- max_tokens: int = 2048,
- enable_thinking: bool = False,
- disable_spec_decode: bool = False,
- stop: List[str] = <factory>,
Bases:
objectSampling parameters (mirrors vLLM’s SamplingParams).
- temperature: float = 0.7#
- top_p: float = 0.9#
- top_k: int = 50#
- max_tokens: int = 2048#
- enable_thinking: bool = False#
- disable_spec_decode: bool = False#
- stop: List[str]#
- __init__(
- temperature: float = 0.7,
- top_p: float = 0.9,
- top_k: int = 50,
- max_tokens: int = 2048,
- enable_thinking: bool = False,
- disable_spec_decode: bool = False,
- stop: List[str] = <factory>,
- class experimental.server.CompletionOutput(
- text: str = '',
- token_ids: List[int] = <factory>,
- finish_reason: str | None = None,
- tool_calls: Dict[str,
- ~typing.Any]]=<factory>,
- reasoning: str | None = None,
Bases:
objectOutput of a single generation request.
- text: str = ''#
- token_ids: List[int]#
- finish_reason: str | None = None#
- tool_calls: List[Dict[str, Any]]#
- reasoning: str | None = None#
- __init__(
- text: str = '',
- token_ids: List[int] = <factory>,
- finish_reason: str | None = None,
- tool_calls: Dict[str,
- ~typing.Any]]=<factory>,
- reasoning: str | None = None,
- class experimental.server.StreamDelta(
- text: str = '',
- token_ids: List[int] = <factory>,
- finished: bool = False,
- finish_reason: str | None = None,
Bases:
objectSingle delta from a streaming generation.
- text: str = ''#
- token_ids: List[int]#
- finished: bool = False#
- finish_reason: str | None = None#
- __init__(
- text: str = '',
- token_ids: List[int] = <factory>,
- finished: bool = False,
- finish_reason: str | None = None,