Python API Reference#
This section provides documentation for the TensorRT Edge-LLM Python package.
New Python workflows use experimental.quantization for checkpoint
quantization, llm_loader for ONNX export, and experimental.server for
the experimental high-level API and OpenAI-compatible server.
Experimental Server#
vLLM-style inference server for TensorRT Edge-LLM.
Public API:
from experimental.server import LLM, SamplingParams
llm = LLM(model="Qwen/Qwen3-1.7B")
outputs = llm.generate(["Hello!"], SamplingParams(max_tokens=64))
print(outputs[0].text)
# Or start an OpenAI-compatible server:
llm.serve(port=8000)
- class experimental.server.LLM(
- model: str = '',
- *,
- onnx_dir: str = '',
- visual_onnx_dir: str = '',
- engine_dir: str = '',
- visual_engine_dir: str = '',
- max_input_len: int = 4096,
- max_batch_size: int = 1,
- max_kv_cache_capacity: int = 8192,
- use_trt_native_ops: bool = False,
- eagle_engine_dir: str = '',
- draft_top_k: int = 10,
- draft_step: int = 6,
- verify_tree_size: int = 60,
Bases:
objectvLLM-style entry point for TensorRT Edge-LLM inference.
Three initialization modes (exactly one of
model,onnx_dir, orengine_dirmust be provided):HuggingFace checkpoint — exports ONNX, builds engine, loads:
llm = LLM(model="Qwen/Qwen3-1.7B")
ONNX directory — builds engine from ONNX, loads:
llm = LLM(onnx_dir="/path/to/onnx")
Pre-built engine — loads directly:
llm = LLM(engine_dir="/path/to/engine") llm = LLM(engine_dir="...", visual_engine_dir="...")
See
experimental.server.engine_layoutfor the expected directory layouts.- __init__(
- model: str = '',
- *,
- onnx_dir: str = '',
- visual_onnx_dir: str = '',
- engine_dir: str = '',
- visual_engine_dir: str = '',
- max_input_len: int = 4096,
- max_batch_size: int = 1,
- max_kv_cache_capacity: int = 8192,
- use_trt_native_ops: bool = False,
- eagle_engine_dir: str = '',
- draft_top_k: int = 10,
- draft_step: int = 6,
- verify_tree_size: int = 60,
- generate(
- prompts: str | List[str] | List[List[Dict[str, Any]]],
- sampling_params: SamplingParams | None = None,
Generate completions for the given prompts.
- Parameters:
prompts – A single prompt string, a list of prompt strings, or a list of OpenAI-style message lists.
sampling_params – Sampling configuration. Defaults to
SamplingParams().
- Returns:
List of
CompletionOutputobjects, one per prompt.
- chat(
- messages: List[Dict[str, Any]],
- sampling_params: SamplingParams | None = None,
Single-turn chat completion (convenience wrapper).
- Parameters:
messages – OpenAI-style message list.
sampling_params – Sampling configuration.
- Returns:
A single
CompletionOutput.
- generate_stream(
- messages: List[Dict[str, Any]],
- sampling_params: SamplingParams | None = None,
Stream generation deltas for a single message list.
Runs
handleRequestin a background thread with aStreamChannelattached, yieldingStreamDeltaobjects as tokens are produced.
- serve(host: str = '0.0.0.0', port: int = 8000) None[source]#
Start an OpenAI-compatible HTTP server.
- Parameters:
host – Bind address.
port – Bind port.
- property model_dir: str#
Path to the resolved model checkpoint.
- property engine_dir: str#
Path to the TensorRT engine directory.
- property has_draft_model: bool#
Whether Eagle speculative decoding is active.
- class experimental.server.SamplingParams(
- temperature: float = 0.7,
- top_p: float = 0.9,
- top_k: int = 50,
- max_tokens: int = 2048,
- enable_thinking: bool = False,
- disable_spec_decode: bool = False,
Bases:
objectSampling parameters (mirrors vLLM’s SamplingParams).
- temperature: float = 0.7#
- top_p: float = 0.9#
- top_k: int = 50#
- max_tokens: int = 2048#
- enable_thinking: bool = False#
- disable_spec_decode: bool = False#
- __init__(
- temperature: float = 0.7,
- top_p: float = 0.9,
- top_k: int = 50,
- max_tokens: int = 2048,
- enable_thinking: bool = False,
- disable_spec_decode: bool = False,
- class experimental.server.CompletionOutput(
- text: str = '',
- token_ids: List[int] = <factory>,
- finish_reason: str | None = None,
Bases:
objectOutput of a single generation request.
- text: str = ''#
- token_ids: List[int]#
- finish_reason: str | None = None#
- __init__(
- text: str = '',
- token_ids: List[int] = <factory>,
- finish_reason: str | None = None,
- class experimental.server.StreamDelta(
- text: str = '',
- token_ids: List[int] = <factory>,
- finished: bool = False,
- finish_reason: str | None = None,
Bases:
objectSingle delta from a streaming generation.
- text: str = ''#
- token_ids: List[int]#
- finished: bool = False#
- finish_reason: str | None = None#
- __init__(
- text: str = '',
- token_ids: List[int] = <factory>,
- finished: bool = False,
- finish_reason: str | None = None,
Experimental Quantization#
Standalone quantization for TensorRT Edge-LLM.
Decoupled from the ONNX exporter — runs in a clean venv with only torch, transformers, and modelopt.
python -m experimental.quantization.cli –help
LLM Loader#
Deprecated Export Package#
The tensorrt_edgellm package contains deprecated Python export utilities that
remain available in 0.7.1 for compatibility. New model enablement should target
the experimental quantization and llm_loader workflow above.