Customization Guide#
New model and feature enablement should target the checkpoint-based workflow:
tensorrt_edgellm/quantization for checkpoint quantization and tensorrt_edgellm
for ONNX export. This guide describes the main extension points used by that
workflow.
For user commands, see Quantization and Checkpoint Exporter Design.
Export Customization Points#
Area |
Files |
What to update |
|---|---|---|
Text model |
|
Add a native |
Registration |
|
Register custom text classes with |
Checkpoint parsing |
|
Add model-specific config fields, tensor remapping, fused-weight splitting, or quantization metadata handling. |
Weight loading |
|
Load safetensors, remap checkpoint keys, and repack quantized tensors into the layout consumed by the exported graph. |
Encoder export |
|
Register visual, audio, or action encoder builders and config extraction rules. |
Component orchestration |
|
Add model-type classification for LLM-only, VLM, audio, TTS, Omni, VLA, EAGLE, or MTP checkpoints. |
Custom ops |
|
Add torch stubs, ONNX translations, and schemas for TensorRT Edge-LLM custom ops. |
Runtime/plugins |
|
Add runtime support only when the exported graph or model I/O needs new behavior. |
Update Supported Models and add focused export tests when adding a new family.
Quantization Customization Points#
The standalone quantization package writes unified HuggingFace-style checkpoints. It does not export ONNX or build TensorRT engines.
Area |
Files |
What to update |
|---|---|---|
Quantization recipes |
|
Add ModelOpt recipe presets, exclusions, or component-specific overrides. |
Model loading and calibration |
|
Add model loading fallbacks, calibration data handling, or pre-save checkpoint fixups. |
EAGLE draft quantization |
|
Update draft-model calibration and checkpoint writing. |
CLI surface |
|
Expose a new supported option after the implementation and tests exist. |
Supported methods are documented in Quantization. GPTQ checkpoints are loaded as pre-quantized checkpoints; this package does not create GPTQ models.
Adding A Text Model#
Check whether the default
CausalLMimplementation can load the checkpoint.If the architecture needs custom behavior, add a model implementation under
tensorrt_edgellm/models/<family>/.Register the
model_typeintensorrt_edgellm/__init__.py.Add any required config promotion, tensor-key remapping, or quantized-weight handling in
tensorrt_edgellm/config.pyandcheckpoint/.Export with
tensorrt-edgellm-export <checkpoint> <output_dir>and verifyllm_buildplusllm_inference.
Adding A Multimodal Or Action Component#
Use tensorrt_edgellm/scripts/export.py as the component dispatcher.
Each exported component should have a stable subdirectory and a config.json
that the C++ builder can consume.
Component |
Typical output |
Builder |
|---|---|---|
LLM thinker/base |
|
|
Visual encoder |
|
|
Audio encoder |
|
|
TTS code predictor |
|
|
Omni Code2Wav |
|
|
Alpamayo action expert |
|
|
MTP draft |
|
|
Keep preprocessing and runtime sidecars explicit. For example, Alpamayo adds
trajectory tokens during runtime artifact writing and requires the action
engine’s max_kv_cache_capacity to match the LLM engine build.
Custom Operators#
Custom operators are declared as torch.library.custom_op stubs in
tensorrt_edgellm/models/ops.py, translated in
tensorrt_edgellm/onnx/dynamo_translations.py, and registered as ONNX
schemas in tensorrt_edgellm/onnx/onnx_custom_schemas.py.
Use this path when a PyTorch expression must lower to a TensorRT Edge-LLM plugin, specialized runtime op, or fixed ONNX node pattern. Runtime support must exist before documenting a new custom op as supported.
Runtime Customization#
The C++ runtime still owns engine build, tokenization, sampling, multimodal preprocessing, LoRA adapter loading, EAGLE/MTP execution, and action inference. Common extension points include:
Area |
Files |
|---|---|
LLM engine profiles |
|
Visual/audio/action builders |
|
Multimodal runners |
|
Action runners |
|
Tokenization |
|
Sampling |
|
Runtime orchestration |
|
Prefer adding runtime behavior only after the exported ONNX contract and sidecar files are stable.
Minimal Runtime Request#
Use the 0.7 runtime request structure when embedding TensorRT Edge-LLM in a C++
application. Requests are batched as request.requests, each request contains
chat-template messages, and generated text is returned in response.outputTexts.
For text-only engines, pass an empty multimodalEngineDir; set it to the
visual or audio engine directory for multimodal workflows.
#include "runtime/llmInferenceRuntime.h"
#include "runtime/llmRuntimeUtils.h"
#include <cuda_runtime_api.h>
#include <iostream>
#include <string>
#include <unordered_map>
#include <utility>
int main()
{
cudaStream_t stream{};
cudaStreamCreate(&stream);
std::string engineDir = "/path/to/engine";
std::string multimodalEngineDir = ""; // empty string = LLM-only, no multimodal engines
std::unordered_map<std::string, std::string> loraWeightsMap{};
trt_edgellm::rt::LLMInferenceRuntime runtime(
engineDir, multimodalEngineDir, loraWeightsMap, stream);
trt_edgellm::rt::LLMGenerationRequest request;
request.requests.resize(1);
request.temperature = 1.0F;
request.topK = 50;
request.topP = 0.8F;
request.maxGenerateLength = 100;
trt_edgellm::rt::Message userMsg;
userMsg.role = "user";
userMsg.contents.push_back(
trt_edgellm::rt::Message::MessageContent{
"text", "What is the capital of France?"});
request.requests[0].messages.push_back(std::move(userMsg));
trt_edgellm::rt::LLMGenerationResponse response;
if (runtime.handleRequest(request, response, stream)
&& !response.outputTexts.empty())
{
std::cout << "Generated: " << response.outputTexts[0] << std::endl;
}
cudaStreamDestroy(stream);
return 0;
}