Customization Guide#
Customization Architecture#
TensorRT Edge-LLM follows a clear data flow from models through to inference, with customization points at each layer:
%%{init: {'theme':'neutral', 'themeVariables': {'primaryColor':'#76B900','primaryTextColor':'#fff','primaryBorderColor':'#5a8f00','lineColor':'#666','edgeLabelBackground':'#ffffff','labelTextColor':'#000','clusterBkg':'#ffffff','clusterBorder':'#999'}}}%%
graph TB
subgraph MODELS ["Models"]
MODEL_ARCH[Model Architecture]
MODEL_WEIGHTS[Model Weights]
end
subgraph EXPORT ["Python Export Pipeline"]
QUANT[Quantization]
ONNX_EXPORT[ONNX Export]
CUSTOM_OPERATORS[Custom Operators]
end
subgraph ENGINE ["Engine Builder"]
BUILD_CONFIG[Build Configuration]
TRT_PLUGINS[TensorRT Plugins]
end
subgraph RUNTIME ["C++ Runtime"]
TOKENIZATION[Text Processing]
SAMPLING_CONFIG[Sampling Parameters]
RUNTIME_CONFIG[Runtime Configuration]
MULTIMODAL_RUNNER[Multimodal Runner]
end
subgraph APP ["Application"]
EXAMPLES[Examples]
end
MODEL_ARCH ~~~ MODEL_WEIGHTS
QUANT ~~~ ONNX_EXPORT
ONNX_EXPORT ~~~ CUSTOM_OPERATORS
BUILD_CONFIG ~~~ TRT_PLUGINS
TOKENIZATION ~~~ SAMPLING_CONFIG
SAMPLING_CONFIG ~~~ RUNTIME_CONFIG
RUNTIME_CONFIG ~~~ MULTIMODAL_RUNNER
MODELS --> EXPORT
EXPORT --> ENGINE
ENGINE --> RUNTIME
RUNTIME --> APP
classDef nvNode fill:#76B900,stroke:#5a8f00,stroke-width:1px,color:#fff
classDef layerBox fill:none,stroke:#76B900,stroke-width:2px
class MODEL_ARCH,MODEL_WEIGHTS,QUANT,ONNX_EXPORT,CUSTOM_OPERATORS,BUILD_CONFIG,TRT_PLUGINS,TOKENIZATION,SAMPLING_CONFIG,RUNTIME_CONFIG,MULTIMODAL_RUNNER,EXAMPLES nvNode
class MODELS,EXPORT,ENGINE,RUNTIME,APP layerBox
Customization Points by Layer#
Layer |
Component |
Customization Options |
|---|---|---|
Models |
Model Architecture |
Configuration: Model config JSON files. Inherit/Adapt: |
Models |
Model Weights |
Drop-in: Load fine-tuned HuggingFace models. Configuration: Specify model directory path |
Python Export |
Quantization Strategy |
Configuration: Choose FP16/FP8/INT4_AWQ/NVFP4/INT8_SQ, calibration settings. Inherit/Adapt: |
Python Export |
ONNX Export Logic |
Inherit/Adapt: |
Python Export |
Custom Operators |
Registration: Custom operators via |
Engine Builder |
Build Configuration |
Configuration: Batch size, sequence length, precision, LoRA rank, EAGLE settings, VLM mode, image tokens. Inherit/Adapt: Setup optimization profiles for custom models |
Engine Builder |
Custom Operations |
Plugin: Implement |
C++ Runtime |
Text Processing |
Configuration: Load different tokenizer vocab files. Inherit/Adapt: |
C++ Runtime |
Sampling Parameters |
Configuration: Temperature, top-k, top-p values in input JSON. Inherit/Adapt: Extend |
C++ Runtime |
Multimodal Runner |
Inherit/Adapt: |
C++ Runtime |
Runtime Behavior |
Configuration: EAGLE parameters, warmup iterations, CUDA graphs, system prompt caching, profiling |
Application |
Custom Applications |
Use examples as templates (see Examples) |
Customization Methods Summary#
Method |
Description |
Best For |
|---|---|---|
Inherit/Adapt |
Extend base classes with custom implementations |
Model architectures, Multimodal runner, PreTokenizer |
Configuration |
Modify behavior via config files and parameters |
Model configs, build settings, runtime parameters, LoRA |
Drop-in Replacement |
Substitute entire components with same interface |
Fine-tuned model weights |
Registration |
Register custom components in existing systems |
ONNX custom operators, graph patterns |
Plugin |
Implement TensorRT plugins for custom operations |
Custom TensorRT layers, hardware-specific ops |
Layer 1: Models#
1. Model Architecture Customization#
The architecture layer provides base classes that wrap HuggingFace models and adapt them for TensorRT Edge-LLM’s optimized inference pipeline.
EdgeLLMModel: The core model class that wraps a HuggingFace model and replaces its decoder layers with optimized variants.EdgeLLMDecoderLayer: Optimized decoder layer that wraps HuggingFace attention and MLP modules.EdgeLLMAttention: Multi-headed attention using the custom attention plugin.
TensorRT Edge-LLM automatically detects and supports Llama-style and Qwen-style architectures:
Llama-style models: Use
LlamaAttentionandLlamaMLPQwen-style models: Use
Qwen2AttentionandQwen2MLP
For other architectures, you may need to extend EdgeLLMAttention to wrap your attention mechanism and modify EdgeLLMDecoderLayer.__init__() to handle your model type:
if "your_model" in config.model_type:
attention_module = YourModelAttention(config, index)
self.mlp = YourModelMLP(config)
self.self_attn = EdgeLLMAttention(attention_module, ...)
2. Model Weights Customization#
TensorRT Edge-LLM supports drop-in replacement of model weights. Any fine-tuned HuggingFace model can be used.
Compatible Fine-Tuning Methods:
Full fine-tuning
PEFT adapters (merged)
GPTQ pre-quantized models
Layer 2: Python Export Pipeline#
1. Quantization Strategy Customization#
Quantization Configuration Customization#
The quantization system is built on NVIDIA’s Model Optimizer and provides multiple quantization strategies.
Available Methods (Defined in tensorrt_edgellm/quantization/llm_quantization.py):
Method |
Config |
Use Case |
Platform |
|---|---|---|---|
FP16 |
N/A (default) |
Baseline accuracy |
All |
FP8 |
|
Balanced speed/accuracy |
SM89+ (Ada Lovelace+) |
INT4 AWQ |
|
Maximum compression |
All |
INT8 SmoothQuant |
|
Activation quantization |
All |
NVFP4 |
|
NVIDIA 4-bit FP |
SM100+ (Blackwell+) |
Custom Quantization Configuration:
You can customize quantization behavior by modifying the config dictionary. For details about the config dictionary, refer to Quantization Formats.
from tensorrt_edgellm.quantization.llm_quantization import get_llm_quant_config
# Get base config
quant_cfg = get_llm_quant_config("fp8", lm_head_quantization=None)
# Customize: Exclude specific layers from quantization
quant_cfg["quant_cfg"]["*embed_tokens*"] = {"enable": False}
quant_cfg["quant_cfg"]["*lm_head*"] = {"enable": False}
# Customize: Change calibration settings
quant_cfg["algorithm"] = "max" # Use max calibration instead of entropy
Calibration Data Customization#
The calibration process uses cnn_dailymail by default. You can customize with your own calibration dataset:
import torch
from torch.utils.data import DataLoader, Dataset
class CustomCalibDataset(Dataset):
"""Custom calibration dataset for domain-specific quantization."""
def __init__(self, tokenizer, texts, max_length=512):
self.tokenizer = tokenizer
self.texts = texts
self.max_length = max_length
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoded = self.tokenizer(
self.texts[idx],
max_length=self.max_length,
padding="max_length",
truncation=True,
return_tensors="pt"
)
return encoded["input_ids"].squeeze(0)
# Load domain-specific texts (e.g., medical, legal)
domain_texts = load_domain_specific_corpus()
calib_dataset = CustomCalibDataset(tokenizer, domain_texts)
calib_dataloader = DataLoader(calib_dataset, batch_size=16, shuffle=False)
# Use custom calibration data for quantization
quantized_model = quantize_model(model, quant_cfg, calib_dataloader)
2. ONNX Export Logic Customization#
Main Export Functions (Defined in tensorrt_edgellm/onnx_export/llm_export.py):
export_llm_model(): Export standard LLM or EAGLE base modelsvisual_export(): Export visual encoder for VLMsexport_draft_model(): Export EAGLE3 draft models
LLM Model Export#
The export_llm_model() function handles LLM and EAGLE base model export. To add support for new model architectures, you may need to customize:
Inputs: Modify
create_dummy_inputs()to add model-specific inputs (e.g.,deepstack_visual_embedsfor Qwen3-VL)Dynamic Axes: Update
export_model_to_onnx()to define dynamic dimensions for new inputs/outputsInput/Output Names: Ensure ONNX input and output names align with the C++ runtime expectations
Multimodal Model Export#
The visual_export() function exports the visual encoder of VLMs to ONNX. Supported models are defined in visual_models.
HuggingFace visual models often contain ONNX-incompatible logic (flash attention, shape-dependent initialization, complex post-processing). To add a new multimodal encoder, you may need to extend the original class with a wrapper that:
Externalizes shape-dependent computations: Pass precomputed tensors (e.g.,
rotary_pos_emb,attention_mask) as explicit inputsReplaces unsupported operations: Use standard PyTorch ops instead of flash attention
Delegates complex preprocessing/postprocessing to C++ runtime: Handle operations like dynamic indexing at runtime
See existing visual_models implementations for reference.
3. Custom Operators#
Custom operators extend PyTorch with operations that map to TensorRT plugins. TensorRT Edge-LLM includes several built-in custom operators:
Attention Plugin (
tensorrt_edgellm/llm_models/layers/attention_plugin.py)INT4 GEMM Plugin (
tensorrt_edgellm/llm_models/layers/int4_gemm_plugin.py)GatherND Plugin (
tensorrt_edgellm/llm_models/layers/gather_nd.py)
You can define and register your own custom operators following the existing examples.
@torch.library.custom_op("trt::attention_plugin", mutates_args=())
def attention_plugin(
qkv: torch.Tensor,
past_key_value: torch.Tensor,
context_lengths: torch.Tensor,
rope_rotary_cos_sin: torch.Tensor,
num_q_heads: int,
num_kv_heads: int,
# ... more parameters ...
) -> Tuple[torch.Tensor, torch.Tensor]:
"""Optimized attention with KV caching and RoPE."""
pass
# Registration
from tensorrt_edgellm.llm_models.layers.attention_plugin import \
register_attention_plugin_onnx_symbolic_functions
register_attention_plugin_onnx_symbolic_functions()
Layer 3: Engine Builder#
1. LLM Builder Customization#
The LLMBuilder class builds TensorRT engines from ONNX models. The CLI tool llm_build provides a command-line interface:
The builder automatically creates two optimization profiles for different inference phases:
Profile |
Phase |
Input Shape |
Purpose |
|---|---|---|---|
0 |
Context (Prefill) |
|
Process initial prompts with variable length |
1 |
Generation (Decode) |
|
Autoregressive token generation |
To support new model architectures, you need to extend the optimization profile setup methods in builder.cpp:
setupCommonProfiles(): Context lengths, RoPE embeddings, KV cache (shared by all models)setupVanillaProfiles(): Standard LLM input_ids and last_token_idssetupEagleProfiles(): EAGLE-specific hidden states, attention mask, position IDssetupVLMProfiles(): Image embeddings for VLMssetupLoraProfiles(): LoRA weight matrices with dynamic rank
2. Multimodal Builder Customization#
The VisualBuilder class builds TensorRT engines for visual encoders. The CLI tool visual_build provides a command-line interface:
To support new visual encoders, you have two options:
Extend VisualBuilder: Add a new profile setup method (e.g.,
setupYourViTProfile()) inbuilder.cpp. Define optimization profiles with input shapes calculated fromminImageTokensandmaxImageTokens.Use trtexec: Build the engine directly with TensorRT’s command-line tool for simpler models.
Layer 4: C++ Runtime#
1. Text Processing and Tokenization#
The Tokenizer class provides text encoding/decoding with HuggingFace-compatible loading. It uses a modular architecture:
PreTokenizer(preTokenizer.h): Regex-based text splitting before encodingTokenEncoder(tokenEncoder.h): BPE encoding algorithm with vocabulary management
The tokenizer automatically loads from the engine directory (uses tokenizer.json and tokenizer_config.json). To support new tokenization schemes, extend TokenEncoder with additional algorithms (e.g., SentencePiece, WordPiece).
2. Sampling Parameters#
The SamplingParams structure controls token generation randomness. Supported parameters:
Parameter |
Range |
Effect |
|---|---|---|
|
0.0 - 2.0 |
0.0: Deterministic (greedy). 1.0: Standard sampling. >1.0: More random/creative |
|
0 - vocab_size |
0: Disabled. 1: Greedy. 50: Sample from top 50 tokens |
|
0.0 - 1.0 |
1.0: Disabled. 0.9: Nucleus sampling (top 90% prob mass) |
To add more sampling params (e.g., repetition_penalty, logits_bias) or custom sampling algorithms (e.g., beam search), extend the sampling functions in sampling.cu.
3. Multimodal Runner#
The MultimodalRunner base class provides the interface for visual encoder processing in VLMs. The factory method MultimodalRunner::create() automatically instantiates the appropriate runner based on model type.
Existing Runners:
QwenViTRunner: Qwen2-VL, Qwen2.5-VL, Qwen3-VLInternViTRunner: InternVL3Phi4MMViTRunner: Phi-4-multimodal
To add support for a new VLM, create a new runner class that inherits from MultimodalRunner and implements:
validateAndFillConfig(): Parse model config and extract dimensionsallocateBuffer(): Allocate input/output tensors and bind to TensorRT enginepreprocess(): Handle image loading, preprocessing (resize, normalize), and RoPE computationinfer(): Execute the visual encoder TensorRT engine
Then register the new runner in MultimodalRunner::create() factory method.
4. Runtime Features#
The LLMInferenceRuntime provides high-level inference APIs with several optimization features:
CUDA Graph Capture: Reduces kernel launch overhead by capturing and replaying execution sequences via
captureDecodingCUDAGraph()System Prompt Caching: Caches KV states for frequently-used system prompts via
genAndSaveSystemPromptKVCache()to reduce first-token latencyLoRA Switching: Dynamically switches between LoRA adapters at runtime without engine rebuild
EAGLE3 Speculative Decoding: Accelerates generation using draft-then-verify approach via
llmInferenceSpecDecodeRuntimewith tree-based draft proposal and verification
To extend runtime capabilities, modify LLMEngineRunner for core execution logic or LLMInferenceRuntime for high-level request handling.
Layer 5: Application#
The llm_inference example provides a reference implementation for building custom applications.
Building Custom Applications#
Use the example as a template and adapt for your use case:
Custom Input/Output Format: Parse your own request format and convert to
LLMGenerationRequestCustom Multimedia Input: The example reads images from files; adapt for your source (video frames, network streams, memory buffers, camera feeds)
// my_custom_app.cpp
#include "runtime/llmInferenceRuntime.h"
#include "tokenizer/tokenizer.h"
#include "common/trtUtils.h"
int main(int argc, char** argv) {
// 1. Parse command line arguments
// ... argument parsing ...
// 2. Initialize CUDA
cudaStream_t stream;
CUDA_CHECK(cudaStreamCreate(&stream));
// 3. Load plugin library
auto pluginHandles = trt_edgellm::loadEdgellmPluginLib();
// 4. Create runtime
auto runtime = trt_edgellm::rt::LLMInferenceRuntime::create(
engineDir,
multimodalEngineDir, // Empty string for text-only
loraWeightsMap,
stream
);
// 5. Optional: Warmup and CUDA graph capture
if (warmupIterations > 0) {
// Run warmup requests
runtime->captureDecodingCUDAGraph(stream);
}
// 6. Process requests
for (auto const& input : inputs) {
trt_edgellm::rt::LLMGenerationRequest request;
trt_edgellm::rt::LLMGenerationResponse response;
// Fill request from input
request.userPrompt = input.prompt;
request.systemPrompt = input.systemPrompt;
request.temperature = input.temperature;
request.topK = input.topK;
request.topP = input.topP;
request.maxGenerateLength = input.maxLength;
// Handle request
if (!runtime->handleRequest(request, response, stream)) {
LOG_ERROR("Failed to handle request");
continue;
}
// Process response
std::cout << "Generated: " << response.generatedText << std::endl;
}
// 7. Cleanup
CUDA_CHECK(cudaStreamDestroy(stream));
return 0;
}
Best Practices#
1. Error Handling:
Check return values from all API calls
Use try-catch for exception-prone operations
Log errors with context for debugging
2. Resource Management:
Create CUDA stream per thread or request
Properly destroy streams and free GPU memory
Use RAII patterns for automatic cleanup
3. Performance Optimization:
Warmup before benchmark measurements
Capture CUDA graphs after warmup
Cache system prompts when applicable
Use appropriate batch sizes for your hardware
4. Testing:
Test with various input lengths
Validate output quality against baseline
Profile to identify bottlenecks
Test edge cases (empty inputs, very long sequences)
5. Deployment:
Package required files (engine, tokenizer, plugins)
Set environment variables (
EDGELLM_PLUGIN_PATH)Document dependencies and system requirements
Provide example inputs and expected outputs