Customization Guide#
New model and feature enablement should target the checkpoint-based workflow:
experimental/quantization for checkpoint quantization and experimental/llm_loader
for ONNX export. This guide describes the main extension points used by that
workflow.
For user commands, see Quantization and Checkpoint-Based Model Loader Design.
The deprecated tensorrt_edgellm/ export package remains available in 0.7.0 and 0.7.1 for
compatibility, but new model and feature work should target the experimental
workflow. The deprecated export package is scheduled for removal in 0.8.0 after
full feature parity is available in experimental/quantization and
experimental/llm_loader.
Export Customization Points#
Area |
Files |
What to update |
|---|---|---|
Text model |
|
Add a native |
Registration |
|
Register custom text classes with |
Checkpoint parsing |
|
Add model-specific config fields, tensor remapping, fused-weight splitting, or quantization metadata handling. |
Weight loading |
|
Load safetensors, remap checkpoint keys, and repack quantized tensors into the layout consumed by the exported graph. |
Encoder export |
|
Register visual, audio, or action encoder builders and config extraction rules. |
Component orchestration |
|
Add model-type classification for LLM-only, VLM, audio, TTS, Omni, VLA, EAGLE, or MTP checkpoints. |
Custom ops |
|
Add torch stubs, ONNX translations, and schemas for TensorRT Edge-LLM custom ops. |
Runtime/plugins |
|
Add runtime support only when the exported graph or model I/O needs new behavior. |
Update Supported Models and add focused export tests when adding a new family.
Quantization Customization Points#
The standalone quantization package writes unified HuggingFace-style checkpoints. It does not export ONNX or build TensorRT engines.
Area |
Files |
What to update |
|---|---|---|
Quantization recipes |
|
Add ModelOpt recipe presets, exclusions, or component-specific overrides. |
Model loading and calibration |
|
Add model loading fallbacks, calibration data handling, or pre-save checkpoint fixups. |
EAGLE draft quantization |
|
Update draft-model calibration and checkpoint writing. |
CLI surface |
|
Expose a new supported option after the implementation and tests exist. |
Supported methods are documented in Quantization. GPTQ checkpoints are loaded as pre-quantized checkpoints; this package does not create GPTQ models.
Adding A Text Model#
Check whether the default
CausalLMimplementation can load the checkpoint.If the architecture needs custom behavior, add a model implementation under
experimental/llm_loader/models/<family>/.Register the
model_typeinexperimental/llm_loader/__init__.py.Add any required config promotion, tensor-key remapping, or quantized-weight handling in
experimental/llm_loader/config.pyandcheckpoint/.Export with
python -m llm_loader.export_all_cli <checkpoint> <output_dir>and verifyllm_buildplusllm_inference.
Adding A Multimodal Or Action Component#
Use experimental/llm_loader/export_all_cli.py as the component dispatcher.
Each exported component should have a stable subdirectory and a config.json
that the C++ builder can consume.
Component |
Typical output |
Builder |
|---|---|---|
LLM thinker/base |
|
|
Visual encoder |
|
|
Audio encoder |
|
|
TTS code predictor |
|
|
Omni Code2Wav |
|
|
Alpamayo action expert |
|
|
MTP draft |
|
|
Keep preprocessing and runtime sidecars explicit. For example, Alpamayo adds
trajectory tokens during runtime artifact writing and requires the action
engine’s max_kv_cache_capacity to match the LLM engine build.
Custom Operators#
Custom operators are declared as torch.library.custom_op stubs in
experimental/llm_loader/models/ops.py, translated in
experimental/llm_loader/onnx/dynamo_translations.py, and registered as ONNX
schemas in experimental/llm_loader/onnx/onnx_custom_schemas.py.
Use this path when a PyTorch expression must lower to a TensorRT Edge-LLM plugin, specialized runtime op, or fixed ONNX node pattern. Runtime support must exist before documenting a new custom op as supported.
Runtime Customization#
The C++ runtime still owns engine build, tokenization, sampling, multimodal preprocessing, LoRA adapter loading, EAGLE/MTP execution, and action inference. Common extension points include:
Area |
Files |
|---|---|
LLM engine profiles |
|
Visual/audio/action builders |
|
Multimodal runners |
|
Action runners |
|
Tokenization |
|
Sampling |
|
Runtime orchestration |
|
Prefer adding runtime behavior only after the exported ONNX contract and sidecar files are stable.
Minimal Runtime Request#
Use the 0.7 runtime request structure when embedding TensorRT Edge-LLM in a C++
application. Requests are batched as request.requests, each request contains
chat-template messages, and generated text is returned in response.outputTexts.
For text-only engines, pass an empty multimodalEngineDir; set it to the
visual or audio engine directory for multimodal workflows.
#include "runtime/llmInferenceRuntime.h"
#include "runtime/llmRuntimeUtils.h"
#include <cuda_runtime_api.h>
#include <iostream>
#include <string>
#include <unordered_map>
#include <utility>
int main()
{
cudaStream_t stream{};
cudaStreamCreate(&stream);
std::string engineDir = "/path/to/engine";
std::string multimodalEngineDir = ""; // empty string = LLM-only, no multimodal engines
std::unordered_map<std::string, std::string> loraWeightsMap{};
trt_edgellm::rt::LLMInferenceRuntime runtime(
engineDir, multimodalEngineDir, loraWeightsMap, stream);
trt_edgellm::rt::LLMGenerationRequest request;
request.requests.resize(1);
request.temperature = 1.0F;
request.topK = 50;
request.topP = 0.8F;
request.maxGenerateLength = 100;
trt_edgellm::rt::Message userMsg;
userMsg.role = "user";
userMsg.contents.push_back(
trt_edgellm::rt::Message::MessageContent{
"text", "What is the capital of France?"});
request.requests[0].messages.push_back(std::move(userMsg));
trt_edgellm::rt::LLMGenerationResponse response;
if (runtime.handleRequest(request, response, stream)
&& !response.outputTexts.empty())
{
std::cout << "Generated: " << response.outputTexts[0] << std::endl;
}
cudaStreamDestroy(stream);
return 0;
}