Python API Reference#

This section provides documentation for the TensorRT Edge-LLM Python package.

Python workflows use tensorrt_edgellm.quantization for checkpoint quantization, tensorrt_edgellm for ONNX export, and experimental.server for the experimental high-level API and OpenAI-compatible server.

Experimental Server#

vLLM-style inference server for TensorRT Edge-LLM.

Public API:

from experimental.server import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen3-1.7B")
outputs = llm.generate(["Hello!"], SamplingParams(max_tokens=64))
print(outputs[0].text)

# Or start an OpenAI-compatible server:
llm.serve(port=8000)

class experimental.server.LLM( model: str = '', *, onnx_dir: str = '', visual_onnx_dir: str = '', engine_dir: str = '', visual_engine_dir: str = '', max_input_len: int = 4096, max_batch_size: int = 1, max_kv_cache_capacity: int = 8192, eagle_engine_dir: str = '', draft_top_k: int = 10, draft_step: int = 6, verify_tree_size: int = 60, )[source]#

Bases: object

vLLM-style entry point for TensorRT Edge-LLM inference.

Three initialization modes (exactly one of model, onnx_dir, or engine_dir must be provided):

HuggingFace checkpoint — exports ONNX, builds engine, loads:
```
llm = LLM(model="Qwen/Qwen3-1.7B")
```
ONNX directory — builds engine from ONNX, loads:
```
llm = LLM(onnx_dir="/path/to/onnx")
```

Pre-built engine — loads directly:

llm = LLM(engine_dir="/path/to/engine")
llm = LLM(engine_dir="...", visual_engine_dir="...")

See experimental.server.engine_layout for the expected directory layouts.

__init__( model: str = '', *, onnx_dir: str = '', visual_onnx_dir: str = '', engine_dir: str = '', visual_engine_dir: str = '', max_input_len: int = 4096, max_batch_size: int = 1, max_kv_cache_capacity: int = 8192, eagle_engine_dir: str = '', draft_top_k: int = 10, draft_step: int = 6, verify_tree_size: int = 60, )[source]#

Generate completions for the given prompts.

Parameters:

prompts – A single prompt string, a list of prompt strings, or a list of OpenAI-style message lists.
sampling_params – Sampling configuration. Defaults to SamplingParams().
tools – Optional OpenAI-compatible tool definitions.
tool_choice – Optional OpenAI-compatible tool choice.

Returns:

List of CompletionOutput objects, one per prompt.

chat( messages: List[Dict[str, Any]], sampling_params: SamplingParams | None = None, *, tools: Sequence[Dict[str, Any]] | None = None, tool_choice: str | Dict[str, Any] | None = None, ) → CompletionOutput[source]#

Single-turn chat completion (convenience wrapper).

Parameters:

messages – OpenAI-style message list.
sampling_params – Sampling configuration.
tools – Optional OpenAI-compatible tool definitions.
tool_choice – Optional OpenAI-compatible tool choice.

Returns:

A single CompletionOutput.

generate_stream( messages: List[Dict[str, Any]], sampling_params: SamplingParams | None = None, *, tools: Sequence[Dict[str, Any]] | None = None, tool_choice: str | Dict[str, Any] | None = None, ) → Generator[StreamDelta, None, None][source]#

Stream generation deltas for a single message list.

Runs handleRequest in a background thread with a StreamChannel attached, yielding StreamDelta objects as tokens are produced.

serve(host: str = '0.0.0.0', port: int = 8000) → None[source]#

Start an OpenAI-compatible HTTP server.

Parameters:

host – Bind address.
port – Bind port.

property model_dir: str#: Path to the resolved model checkpoint.

property engine_dir: str#: Path to the TensorRT engine directory.

property has_draft_model: bool#: Whether Eagle speculative decoding is active.

class experimental.server.SamplingParams( temperature: float = 0.7, top_p: float = 0.9, top_k: int = 50, max_tokens: int = 2048, enable_thinking: bool = False, disable_spec_decode: bool = False, num_logprobs: int = 0, stop: ~typing.List[str] = <factory>, logit_bias: ~typing.Dict[int, float] = <factory>, )[source]#

Bases: object

Sampling parameters (mirrors vLLM’s SamplingParams).

temperature: float = 0.7#

top_p: float = 0.9#

top_k: int = 50#

max_tokens: int = 2048#

enable_thinking: bool = False#

disable_spec_decode: bool = False#

num_logprobs: int = 0#

stop: List[str]#

logit_bias: Dict[int, float]#

__init__( temperature: float = 0.7, top_p: float = 0.9, top_k: int = 50, max_tokens: int = 2048, enable_thinking: bool = False, disable_spec_decode: bool = False, num_logprobs: int = 0, stop: ~typing.List[str] = <factory>, logit_bias: ~typing.Dict[int, float] = <factory>, ) → None#

class experimental.server.CompletionOutput( text: str = '', token_ids: ~typing.List[int] = <factory>, finish_reason: str | None = None, logprobs: ~typing.List[~typing.List[~experimental.server.engine.LogprobEntry]] = <factory>, tool_calls: ~typing.List[~typing.Dict[str, ~typing.Any]] = <factory>, reasoning: str | None = None, )[source]#

Bases: object

Output of a single generation request.

text: str = ''#

token_ids: List[int]#

finish_reason: str | None = None#

logprobs: List[List[LogprobEntry]]#

tool_calls: List[Dict[str, Any]]#

reasoning: str | None = None#

__init__( text: str = '', token_ids: ~typing.List[int] = <factory>, finish_reason: str | None = None, logprobs: ~typing.List[~typing.List[~experimental.server.engine.LogprobEntry]] = <factory>, tool_calls: ~typing.List[~typing.Dict[str, ~typing.Any]] = <factory>, reasoning: str | None = None, ) → None#

class experimental.server.StreamDelta( text: str = '', token_ids: ~typing.List[int] = <factory>, finished: bool = False, finish_reason: str | None = None, logprobs: ~typing.List[~typing.List[~experimental.server.engine.LogprobEntry]] = <factory>, )[source]#

Bases: object

Single delta from a streaming generation.

text: str = ''#

token_ids: List[int]#

finished: bool = False#

finish_reason: str | None = None#

logprobs: List[List[LogprobEntry]]#

__init__( text: str = '', token_ids: ~typing.List[int] = <factory>, finished: bool = False, finish_reason: str | None = None, logprobs: ~typing.List[~typing.List[~experimental.server.engine.LogprobEntry]] = <factory>, ) → None#

Python API Reference#

Experimental Server#

Quantization#

Checkpoint Exporter#