Python API Reference#

This section provides documentation for the TensorRT Edge-LLM Python package.

The tensorrt_edgellm package provides utilities for quantizing large language models and exporting them to ONNX format for efficient inference on edge devices.

Main Module#

TensorRT Edge-LLM - A Python package for quantizing and exporting LLMs for edge deployment.

This package provides utilities for quantizing large language models using NVIDIA ModelOpt and preparing them for ONNX export and edge deployment. It supports various quantization schemes including FP8, INT4 AWQ, and NVFP4 for efficient inference on edge devices.

Key Features:

LLM quantization with calibration support
Multiple quantization schemes (FP8, INT4 AWQ, NVFP4)
Automatic model type detection
HuggingFace model compatibility
Quantization configuration management
ONNX export for LLM and visual models
LoRA pattern insertion and weight processing

Example Usage:

from tensorrt_edgellm import (
    quantize_and_save_llm,
    quantize_and_save_draft,
    export_llm_model,
    export_draft_model,
    visual_export,
    insert_lora_and_save,
    process_lora_weights_and_save
)

# Quantize and save a standard LLM model
quantize_and_save_llm(
    model_dir="path/to/model",
    output_dir="path/to/output",
    quantization="fp8",
    dtype="fp16",
    dataset_dir="cnn_dailymail"
)

# Quantize and save an EAGLE draft model
quantize_and_save_draft(
    base_model_dir="path/to/base_model",
    draft_model_dir="path/to/draft_model",
    output_dir="path/to/output",
    quantization="fp8",
    dtype="fp16",
    dataset_dir="cnn_dailymail"
)

# Export standard LLM to ONNX
export_llm_model(
    model_dir="path/to/model",
    output_dir="path/to/output",
    device="cuda"
)

# Export EAGLE base model to ONNX
export_llm_model(
    model_dir="path/to/model",
    output_dir="path/to/output",
    is_eagle_base=True
)

# Export EAGLE draft model to ONNX
export_draft_model(
    draft_model_dir="path/to/draft_model",
    output_dir="path/to/output",
    base_model_dir="path/to/base_model",
    use_prompt_tuning=False
)

# Export visual model to ONNX
visual_export(
    model_dir="path/to/model",
    output_dir="path/to/output",
    dtype="fp16"
)

# Insert LoRA patterns into ONNX models
insert_lora_and_save(
    onnx_dir="path/to/onnx_model"
)

# Process LoRA weights
process_lora_weights_and_save(
    input_dir="path/to/adapter",
    output_dir="path/to/output"
)
# Reduce vocabulary
reduce_vocab_size(
    model_dir="path/to/model",
    output_dir="path/to/output",
    reduced_vocab_size=30000
)

tensorrt_edgellm.quantize_and_save_llm( model_dir: str, output_dir: str, quantization: str | None = None, dtype: str = 'fp16', dataset_dir: str = 'cnn_dailymail', lm_head_quantization: str | None = None, device: str = 'cuda', ) → None[source]#

Load a model, quantize it if specified, and save the result.

This is the main entry point for quantizing language models. It supports various quantization schemes including FP8, INT4 AWQ, and NVFP4.

Parameters:

model_dir – Directory containing the input HuggingFace model
output_dir – Directory to save the quantized model
quantization – Quantization method to apply (None, “fp8”, “int4_awq”, “nvfp4”, “int8_sq”)
dtype – Model data type for loading (“fp16”)
dataset_dir – Dataset name or path for calibration data
lm_head_quantization – Optional separate quantization for language model head (only “fp8” and “nvfp4” is currently supported)
device – Device to use for model loading and quantization (“cuda”, “cpu”)

Raises:

ValueError – If model loading fails or quantization parameters are invalid

tensorrt_edgellm.quantize_and_save_draft( base_model_dir: str, draft_model_dir: str, output_dir: str, quantization: str | None = None, device: str = 'cuda', dtype: str = 'fp16', dataset_dir: str = 'cnn_dailymail', lm_head_quantization: str | None = None, ) → None[source]#

Load an EAGLE draft model, quantize it if specified, and save the result.

This is the main entry point for quantizing EAGLE draft models. It requires both a base model and draft model directory.

Parameters:

base_model_dir – Directory containing the base HuggingFace model
draft_model_dir – Directory containing the EAGLE draft model
output_dir – Directory to save the quantized model
quantization – Quantization method to apply (None, “fp8”, “int4_awq”, “nvfp4”, “int8_sq”)
device – Device to use for model loading and quantization (“cuda”, “cpu”)
dtype – Model data type for loading (“fp16”)
dataset_dir – Dataset name or path for calibration data
lm_head_quantization – Optional separate quantization for language model head (only “fp8” and “nvfp4” is currently supported)

Raises:

ValueError – If model loading fails or quantization parameters are invalid

tensorrt_edgellm.export_draft_model( draft_model_dir: str, output_dir: str, use_prompt_tuning: bool = False, base_model_dir: str | None = None, device: str = 'cuda', ) → None[source]#

Export an EAGLE draft model to ONNX format with custom attention plugin.

This is the main entry point for exporting EAGLE draft models to ONNX format. The draft model requires a base model for weight copying.

Parameters:

draft_model_dir – Directory containing the EAGLE draft model
output_dir – Directory to save the exported ONNX model
use_prompt_tuning – Whether the model uses prompt tuning (for VLM models)
base_model_dir – Directory containing the base model (for weight copying)
device – Device to load the model on (“cpu”, “cuda”, or “cuda:0”, “cuda:1”, etc.)

tensorrt_edgellm.export_llm_model( model_dir: str, output_dir: str, device: str = 'cuda', is_eagle_base: bool = False, reduced_vocab_dir: str | None = None, chat_template_path: str | None = None, ) → None[source]#

Export a language model to ONNX format with custom attention plugin.

This is the main entry point for exporting standard LLM models and EAGLE base models to ONNX format with TensorRT Edge-LLM optimizations.

Parameters:

model_dir – Directory containing the HuggingFace model
output_dir – Directory to save the exported ONNX model
device – Device to load the model on (“cpu”, “cuda”, or “cuda:0”, “cuda:1”, etc.)
is_eagle_base – Whether the model is an EAGLE3 base model (vs standard LLM)
reduced_vocab_dir – Directory containing vocab_map.safetensors for vocabulary reduction (optional)
chat_template_path – Path to chat template JSON file. When provided, this template is validated and used instead of inferring from the model (optional)

tensorrt_edgellm.visual_export( model_dir: str, output_dir: str, dtype: str, quantization: str | None, dataset_dir: str | None = 'lmms-lab/MMMU', device: str = 'cuda', ) → str[source]#

Export visual model using the appropriate wrapper based on model architecture.

This function loads a multimodal model, extracts its visual component, wraps it in the appropriate model wrapper, applies quantization if requested, and exports it to ONNX format.

Parameters:

model_dir – Directory containing the torch model
output_dir – Directory to save the exported ONNX model
dtype – Data type for export (currently only “fp16” supported)
quantization – Quantization type (“fp8” or None)
device – Device to load the model on (default: “cuda”, options: cpu, cuda, cuda:0, cuda:1, etc.)

Returns:

Path to the output directory where the exported model is saved

Return type:

str

Raises:

ValueError – If unsupported dtype or quantization is provided
ValueError – If unsupported model type is detected

tensorrt_edgellm.insert_lora_and_save(onnx_dir: str)[source]#

Insert LoRA patterns into ONNX models.

Parameters:

onnx_dir (str) – Directory containing the ONNX model (model.onnx and config.json)
output_dir (str) – Directory to save the modified ONNX model
mode (str) – LoRA insertion mode: ‘dynamic’ (default) or ‘static’
lora_weights_dir (str) – Directory containing LoRA weights (required for static mode)

tensorrt_edgellm.process_lora_weights_and_save(input_dir: str, output_dir: str)[source]#

Process LoRA weights according to specified requirements.

Parameters:

input_dir (str) – Directory containing input adapter files
output_dir (str) – Directory where processed files will be saved

tensorrt_edgellm.reduce_vocab_size( tokenizer: transformers.AutoTokenizer, config: transformers.AutoConfig, dataset: datasets.Dataset, reduced_vocab_size: int, d2t_tensor: torch.Tensor | None = None, method: str = 'frequency', ) → torch.Tensor[source]#

Reduce vocabulary based on selected method.

Parameters:

tokenizer – HuggingFace AutoTokenizer instance
config – HuggingFace AutoConfig instance
dataset – Dataset to analyze for token frequency
reduced_vocab_size – Target vocabulary size (must be < config.vocab_size)
d2t_tensor – Optional EAGLE d2t tensor for required tokens
method – Vocabulary reduction method (‘frequency’ or ‘input_aware’)

Returns:

torch.Tensor of shape (reduced_vocab_size,) mapping: reduced token IDs to original token IDs (int32)

Return type:

vocab_map

Raises:

ValueError – If reduced_vocab_size >= config.vocab_size
ValueError – If method is not ‘frequency’ or ‘input_aware’