LoRA (Low-Rank Adaptation)#

Overview#

TensorRT Edge-LLM supports two approaches for using LoRA adapters:

Approach	Scripts	Use Case
Static Merge	`tensorrt-edgellm-merge-lora`	Permanently merge LoRA into base model before export
Dynamic Runtime	`tensorrt-edgellm-insert-lora` + `tensorrt-edgellm-process-lora`	Switch between multiple adapters at runtime

Approach 1: Static LoRA Merge#

Permanently merges LoRA weights into the base model. Use this when:

The LoRA is always required (e.g., Phi-4-multimodal vision-lora)
You don’t need runtime adapter switching
You want simpler deployment with a single merged model

Workflow#

# Step 1: Merge LoRA into base model
tensorrt-edgellm-merge-lora \
  --model_dir Phi-4-multimodal-instruct \
  --lora_dir Phi-4-multimodal-instruct/vision-lora \
  --output_dir merged_model

# Step 2: Continue with standard export pipeline
tensorrt-edgellm-quantize-llm \
  --model_dir merged_model \
  --output_dir quantized \
  --quantization fp8

tensorrt-edgellm-export-llm \
  --model_dir quantized \
  --output_dir llm_onnx

# Step 3-4: Build and run as usual (no LoRA flags needed)

Approach 2: Dynamic Runtime LoRA#

Enables switching between multiple LoRA adapters at runtime without rebuilding engines. Use this when:

You have multiple domain-specific adapters
You need multi-tenant serving with different fine-tuned models
You want A/B testing between adapter variants

Workflow#

# Step 1: Export base model
tensorrt-edgellm-export-llm \
  --model_dir Qwen/Qwen2.5-0.5B-Instruct \
  --output_dir llm_onnx

# Step 2: Insert LoRA support into ONNX (creates lora_model.onnx)
tensorrt-edgellm-insert-lora \
  --onnx_dir llm_onnx

# Step 3: Process each LoRA adapter
tensorrt-edgellm-process-lora \
  --input_dir /path/to/adapter1 \
  --output_dir llm_onnx/lora_weights/adapter1

tensorrt-edgellm-process-lora \
  --input_dir /path/to/adapter2 \
  --output_dir llm_onnx/lora_weights/adapter2

# Step 4: Build engine with LoRA support
./build/examples/llm/llm_build \
  --onnxDir llm_onnx \
  --engineDir engines \
  --maxBatchSize 1 \
  --maxLoraRank 64

# Step 5: Run inference with adapter selection (see Input Format below)
./build/examples/llm/llm_inference \
  --engineDir engines \
  --inputFile input.json \
  --outputFile output.json

Input Format for Runtime LoRA#

Specify available adapters and select per-request:

{
    "available_lora_weights": {
        "french": "/path/to/engines/lora_weights/french/processed_adapter_model.safetensors",
        "medical": "/path/to/engines/lora_weights/medical/processed_adapter_model.safetensors"
    },
    "requests": [
        {
            "messages": [
                {"role": "user", "content": "Translate to French: Hello world"}
            ],
            "lora_name": "french"
        },
        {
            "messages": [
                {"role": "user", "content": "What is aspirin?"}
            ],
            "lora_name": "medical"
        }
    ]
}

Note: All requests in the same batch must use the same LoRA adapter. To disable LoRA, omit lora_name or set it to empty string.

Script Reference#

`tensorrt-edgellm-merge-lora`#

Permanently merges LoRA weights into a base HuggingFace model.

Argument	Required	Default	Description
`--model_dir`	Yes	-	Base model directory
`--lora_dir`	Yes	-	LoRA checkpoint directory
`--output_dir`	Yes	-	Output directory for merged model
`--device`	No	`cuda`	Device for loading model

`tensorrt-edgellm-insert-lora`#

Inserts LoRA patterns into an exported ONNX model, creating lora_model.onnx.

Argument	Required	Description
`--onnx_dir`	Yes	Directory containing `model.onnx`

Output: Creates lora_model.onnx in the same directory.

`tensorrt-edgellm-process-lora`#

Processes HuggingFace LoRA adapter weights for runtime use.

Argument	Required	Description
`--input_dir`	Yes	Directory with `adapter_config.json` and `adapter_model.safetensors`
`--output_dir`	Yes	Output directory for processed weights

Output: Creates processed_adapter_model.safetensors and config.json.

Processing:

Converts bf16 → fp16
Applies scaling: lora_B *= lora_alpha / r
Transposes tensors to correct shapes
Filters out norm and lm_head layers

Build Parameters#

When building with dynamic LoRA support:

Parameter	Description
`--maxLoraRank`	Maximum LoRA rank to support (e.g., 64). Set to 0 to disable LoRA.

Notes#

Static merge is simpler but doesn’t allow runtime switching
Dynamic LoRA adds slight overhead but enables multi-adapter deployments
All adapters must have rank ≤ --maxLoraRank specified at build time
CUDA graphs are captured separately for each LoRA configuration

LoRA (Low-Rank Adaptation)#

Overview#

Approach 1: Static LoRA Merge#

Workflow#

Approach 2: Dynamic Runtime LoRA#

Workflow#

Input Format for Runtime LoRA#

Script Reference#

tensorrt-edgellm-merge-lora#

tensorrt-edgellm-insert-lora#

tensorrt-edgellm-process-lora#

Build Parameters#

Notes#

`tensorrt-edgellm-merge-lora`#

`tensorrt-edgellm-insert-lora`#

`tensorrt-edgellm-process-lora`#