LoRA (Low-Rank Adaptation)#
TensorRT Edge-LLM supports LoRA through the checkpoint-based llm_loader workflow.
Use dynamic LoRA when you need to select adapters at runtime. Use static LoRA
merge when the adapter is always required, such as the Phi-4-Multimodal
vision-lora adapter.
Setup#
export EDGE_LLM_PATH=/path/to/TensorRT-Edge-LLM
export PYTHONPATH=$EDGE_LLM_PATH:$EDGE_LLM_PATH/experimental:$PYTHONPATH
Dynamic Runtime LoRA#
This workflow exports the base checkpoint, inserts LoRA inputs into the ONNX graph, processes HuggingFace adapter weights, then builds an engine with a maximum adapter rank.
# Step 1: Export the base model with llm_loader
python -m llm_loader.export_all_cli \
/path/to/base_model \
/tmp/onnx_output
# Step 2: Insert LoRA support into the exported LLM graph
python -m llm_loader.lora.insert_lora_cli \
--onnx_dir /tmp/onnx_output/llm
# Step 3: Convert each adapter to the runtime sidecar format
python -m llm_loader.lora.process_lora_weights_cli \
--input_dir /path/to/adapter1 \
--output_dir /tmp/onnx_output/llm/lora_weights/adapter1
python -m llm_loader.lora.process_lora_weights_cli \
--input_dir /path/to/adapter2 \
--output_dir /tmp/onnx_output/llm/lora_weights/adapter2
# Step 4: Build the engine with LoRA support
./build/examples/llm/llm_build \
--onnxDir /tmp/onnx_output/llm \
--engineDir engines \
--maxBatchSize 1 \
--maxLoraRank 64
# Step 5: Run inference with adapter selection in input.json
./build/examples/llm/llm_inference \
--engineDir engines \
--inputFile input.json \
--outputFile output.json
Static LoRA Merge#
Static merge permanently applies a LoRA adapter to the base HuggingFace checkpoint before optional quantization and ONNX export.
# Step 1: Merge LoRA into the base checkpoint
python -m llm_loader.lora.merge_lora_cli \
--model_dir Phi-4-multimodal-instruct \
--lora_dir Phi-4-multimodal-instruct/vision-lora \
--output_dir merged_model
# Step 2: Optional quantization of the merged checkpoint
python -m experimental.quantization llm \
--model_dir merged_model \
--output_dir quantized_model \
--quantization nvfp4 \
--lm_head_quantization nvfp4
# Step 3: Export the checkpoint with llm_loader
python -m llm_loader.export_all_cli \
quantized_model \
onnx_output
If you do not need weight quantization, export merged_model directly in step 3.
Input Format#
Specify available adapters and select one adapter per request:
{
"available_lora_weights": {
"french": "/path/to/lora_weights/french/processed_adapter_model.safetensors",
"medical": "/path/to/lora_weights/medical/processed_adapter_model.safetensors"
},
"requests": [
{
"messages": [
{"role": "user", "content": "Translate to French: Hello world"}
],
"lora_name": "french"
},
{
"messages": [
{"role": "user", "content": "What is aspirin?"}
],
"lora_name": "medical"
}
]
}
All requests in the same batch must use the same LoRA adapter. To disable LoRA,
omit lora_name or set it to an empty string.
Script Reference#
llm_loader.lora.insert_lora_cli#
Inserts LoRA patterns into an exported llm_loader ONNX model and creates
lora_model.onnx in the same directory.
Argument |
Required |
Description |
|---|---|---|
|
Yes |
Directory containing |
llm_loader.lora.process_lora_weights_cli#
Processes HuggingFace LoRA adapter weights for runtime use.
Argument |
Required |
Description |
|---|---|---|
|
Yes |
Directory with |
|
Yes |
Output directory for processed weights |
The output contains processed_adapter_model.safetensors and config.json.
llm_loader.lora.merge_lora_cli#
Permanently merges LoRA weights into a base HuggingFace checkpoint.
Argument |
Required |
Default |
Description |
|---|---|---|---|
|
Yes |
- |
Base model directory |
|
Yes |
- |
LoRA adapter directory |
|
Yes |
- |
Output directory for merged checkpoint |
|
No |
|
Device used while merging |
|
No |
|
Model dtype used while merging |
Build Parameters#
Parameter |
Description |
|---|---|
|
Maximum LoRA rank to support. Set to |
Notes#
Static merge produces a single checkpoint and does not require runtime LoRA flags.
Dynamic LoRA enables adapter switching without rebuilding, but all adapters must have rank less than or equal to
--maxLoraRank.CUDA graphs are captured separately for each LoRA configuration.