LoRA (Low-Rank Adaptation)#
Overview#
TensorRT Edge-LLM supports two approaches for using LoRA adapters:
Approach |
Scripts |
Use Case |
|---|---|---|
Static Merge |
|
Permanently merge LoRA into base model before export |
Dynamic Runtime |
|
Switch between multiple adapters at runtime |
Approach 1: Static LoRA Merge#
Permanently merges LoRA weights into the base model. Use this when:
The LoRA is always required (e.g., Phi-4-multimodal vision-lora)
You don’t need runtime adapter switching
You want simpler deployment with a single merged model
Workflow#
# Step 1: Merge LoRA into base model
tensorrt-edgellm-merge-lora \
--model_dir Phi-4-multimodal-instruct \
--lora_dir Phi-4-multimodal-instruct/vision-lora \
--output_dir merged_model
# Step 2: Continue with standard export pipeline
tensorrt-edgellm-quantize-llm \
--model_dir merged_model \
--output_dir quantized \
--quantization fp8
tensorrt-edgellm-export-llm \
--model_dir quantized \
--output_dir llm_onnx
# Step 3-4: Build and run as usual (no LoRA flags needed)
Approach 2: Dynamic Runtime LoRA#
Enables switching between multiple LoRA adapters at runtime without rebuilding engines. Use this when:
You have multiple domain-specific adapters
You need multi-tenant serving with different fine-tuned models
You want A/B testing between adapter variants
Workflow#
# Step 1: Export base model
tensorrt-edgellm-export-llm \
--model_dir Qwen/Qwen2.5-0.5B-Instruct \
--output_dir llm_onnx
# Step 2: Insert LoRA support into ONNX (creates lora_model.onnx)
tensorrt-edgellm-insert-lora \
--onnx_dir llm_onnx
# Step 3: Process each LoRA adapter
tensorrt-edgellm-process-lora \
--input_dir /path/to/adapter1 \
--output_dir llm_onnx/lora_weights/adapter1
tensorrt-edgellm-process-lora \
--input_dir /path/to/adapter2 \
--output_dir llm_onnx/lora_weights/adapter2
# Step 4: Build engine with LoRA support
./build/examples/llm/llm_build \
--onnxDir llm_onnx \
--engineDir engines \
--maxBatchSize 1 \
--maxLoraRank 64
# Step 5: Run inference with adapter selection (see Input Format below)
./build/examples/llm/llm_inference \
--engineDir engines \
--inputFile input.json \
--outputFile output.json
Input Format for Runtime LoRA#
Specify available adapters and select per-request:
{
"available_lora_weights": {
"french": "/path/to/engines/lora_weights/french/processed_adapter_model.safetensors",
"medical": "/path/to/engines/lora_weights/medical/processed_adapter_model.safetensors"
},
"requests": [
{
"messages": [
{"role": "user", "content": "Translate to French: Hello world"}
],
"lora_name": "french"
},
{
"messages": [
{"role": "user", "content": "What is aspirin?"}
],
"lora_name": "medical"
}
]
}
Note: All requests in the same batch must use the same LoRA adapter. To disable LoRA, omit lora_name or set it to empty string.
Script Reference#
tensorrt-edgellm-merge-lora#
Permanently merges LoRA weights into a base HuggingFace model.
Argument |
Required |
Default |
Description |
|---|---|---|---|
|
Yes |
- |
Base model directory |
|
Yes |
- |
LoRA checkpoint directory |
|
Yes |
- |
Output directory for merged model |
|
No |
|
Device for loading model |
tensorrt-edgellm-insert-lora#
Inserts LoRA patterns into an exported ONNX model, creating lora_model.onnx.
Argument |
Required |
Description |
|---|---|---|
|
Yes |
Directory containing |
Output: Creates lora_model.onnx in the same directory.
tensorrt-edgellm-process-lora#
Processes HuggingFace LoRA adapter weights for runtime use.
Argument |
Required |
Description |
|---|---|---|
|
Yes |
Directory with |
|
Yes |
Output directory for processed weights |
Output: Creates processed_adapter_model.safetensors and config.json.
Processing:
Converts bf16 → fp16
Applies scaling:
lora_B *= lora_alpha / rTransposes tensors to correct shapes
Filters out norm and lm_head layers
Build Parameters#
When building with dynamic LoRA support:
Parameter |
Description |
|---|---|
|
Maximum LoRA rank to support (e.g., 64). Set to 0 to disable LoRA. |
Notes#
Static merge is simpler but doesn’t allow runtime switching
Dynamic LoRA adds slight overhead but enables multi-adapter deployments
All adapters must have rank ≤
--maxLoraRankspecified at build timeCUDA graphs are captured separately for each LoRA configuration