Python API Reference#

This section provides documentation for the TensorRT Edge-LLM Python package.

New Python workflows use experimental.quantization for checkpoint quantization, llm_loader for ONNX export, and experimental.server for the experimental high-level API and OpenAI-compatible server.

Experimental Server#

vLLM-style inference server for TensorRT Edge-LLM.

Public API:

from experimental.server import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen3-1.7B")
outputs = llm.generate(["Hello!"], SamplingParams(max_tokens=64))
print(outputs[0].text)

# Or start an OpenAI-compatible server:
llm.serve(port=8000)
class experimental.server.LLM(
model: str = '',
*,
onnx_dir: str = '',
visual_onnx_dir: str = '',
engine_dir: str = '',
visual_engine_dir: str = '',
max_input_len: int = 4096,
max_batch_size: int = 1,
max_kv_cache_capacity: int = 8192,
use_trt_native_ops: bool = False,
eagle_engine_dir: str = '',
draft_top_k: int = 10,
draft_step: int = 6,
verify_tree_size: int = 60,
)[source]#

Bases: object

vLLM-style entry point for TensorRT Edge-LLM inference.

Three initialization modes (exactly one of model, onnx_dir, or engine_dir must be provided):

  1. HuggingFace checkpoint — exports ONNX, builds engine, loads:

    llm = LLM(model="Qwen/Qwen3-1.7B")
    
  2. ONNX directory — builds engine from ONNX, loads:

    llm = LLM(onnx_dir="/path/to/onnx")
    
  3. Pre-built engine — loads directly:

    llm = LLM(engine_dir="/path/to/engine")
    llm = LLM(engine_dir="...", visual_engine_dir="...")
    

See experimental.server.engine_layout for the expected directory layouts.

__init__(
model: str = '',
*,
onnx_dir: str = '',
visual_onnx_dir: str = '',
engine_dir: str = '',
visual_engine_dir: str = '',
max_input_len: int = 4096,
max_batch_size: int = 1,
max_kv_cache_capacity: int = 8192,
use_trt_native_ops: bool = False,
eagle_engine_dir: str = '',
draft_top_k: int = 10,
draft_step: int = 6,
verify_tree_size: int = 60,
)[source]#
generate(
prompts: str | List[str] | List[List[Dict[str, Any]]],
sampling_params: SamplingParams | None = None,
) List[CompletionOutput][source]#

Generate completions for the given prompts.

Parameters:
  • prompts – A single prompt string, a list of prompt strings, or a list of OpenAI-style message lists.

  • sampling_params – Sampling configuration. Defaults to SamplingParams().

Returns:

List of CompletionOutput objects, one per prompt.

chat(
messages: List[Dict[str, Any]],
sampling_params: SamplingParams | None = None,
) CompletionOutput[source]#

Single-turn chat completion (convenience wrapper).

Parameters:
  • messages – OpenAI-style message list.

  • sampling_params – Sampling configuration.

Returns:

A single CompletionOutput.

generate_stream(
messages: List[Dict[str, Any]],
sampling_params: SamplingParams | None = None,
) Generator[StreamDelta, None, None][source]#

Stream generation deltas for a single message list.

Runs handleRequest in a background thread with a StreamChannel attached, yielding StreamDelta objects as tokens are produced.

serve(host: str = '0.0.0.0', port: int = 8000) None[source]#

Start an OpenAI-compatible HTTP server.

Parameters:
  • host – Bind address.

  • port – Bind port.

property model_dir: str#

Path to the resolved model checkpoint.

property engine_dir: str#

Path to the TensorRT engine directory.

property has_draft_model: bool#

Whether Eagle speculative decoding is active.

class experimental.server.SamplingParams(
temperature: float = 0.7,
top_p: float = 0.9,
top_k: int = 50,
max_tokens: int = 2048,
enable_thinking: bool = False,
disable_spec_decode: bool = False,
)[source]#

Bases: object

Sampling parameters (mirrors vLLM’s SamplingParams).

temperature: float = 0.7#
top_p: float = 0.9#
top_k: int = 50#
max_tokens: int = 2048#
enable_thinking: bool = False#
disable_spec_decode: bool = False#
__init__(
temperature: float = 0.7,
top_p: float = 0.9,
top_k: int = 50,
max_tokens: int = 2048,
enable_thinking: bool = False,
disable_spec_decode: bool = False,
) None#
class experimental.server.CompletionOutput(
text: str = '',
token_ids: List[int] = <factory>,
finish_reason: str | None = None,
)[source]#

Bases: object

Output of a single generation request.

text: str = ''#
token_ids: List[int]#
finish_reason: str | None = None#
__init__(
text: str = '',
token_ids: List[int] = <factory>,
finish_reason: str | None = None,
) None#
class experimental.server.StreamDelta(
text: str = '',
token_ids: List[int] = <factory>,
finished: bool = False,
finish_reason: str | None = None,
)[source]#

Bases: object

Single delta from a streaming generation.

text: str = ''#
token_ids: List[int]#
finished: bool = False#
finish_reason: str | None = None#
__init__(
text: str = '',
token_ids: List[int] = <factory>,
finished: bool = False,
finish_reason: str | None = None,
) None#

Experimental Quantization#

Standalone quantization for TensorRT Edge-LLM.

Decoupled from the ONNX exporter — runs in a clean venv with only torch, transformers, and modelopt.

python -m experimental.quantization.cli –help

LLM Loader#

Deprecated Export Package#

The tensorrt_edgellm package contains deprecated Python export utilities that remain available in 0.7.1 for compatibility. New model enablement should target the experimental quantization and llm_loader workflow above.