Alpamayo-R1-10B (Vision-Language-Action)#
Complete workflow for running nvidia/Alpamayo-R1-10B, a Vision-Language-Action (VLA) model that combines a VLM backbone with an action expert.
Currently supported model: nvidia/Alpamayo-R1-10B
Prerequisites: Complete the Installation Guide for both x86 host and edge device before proceeding. The Alpamayo 1 checkpoint may require Hugging Face login and license access before downloading.
Architecture Overview#
Alpamayo-R1-10B runs as a chained VLM + action pipeline:
VLM backbone — exported through
llm_loaderasonnx/llmandonnx/visual; it processes image, text, and trajectory-history tokens and generates language output.Action expert — exported through
llm_loaderasonnx/action; it consumes the VLM KV cache and produces future action waypoints.
Both stages are run with a single action_inference invocation. The engines are built separately and loaded together at runtime.
The current TensorRT Edge-LLM input parser accepts image files, text, and
trajectory history points as [x, y, z]. The action output is written as
output_trajectory, a list of (accel, kappa) acceleration and curvature pairs.
Step 1: Export (x86 Host)#
export EDGE_LLM_PATH=/path/to/TensorRT-Edge-LLM
export PYTHONPATH=$EDGE_LLM_PATH:$EDGE_LLM_PATH/experimental:$PYTHONPATH
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Alpamayo-R1-10B
mkdir -p $WORKSPACE_DIR
cd $WORKSPACE_DIR
# Download Alpamayo-R1-10B from HuggingFace
hf auth login
hf download nvidia/Alpamayo-R1-10B --local-dir $MODEL_NAME
# Export language model backbone, visual encoder, and action expert
python -m llm_loader.export_all_cli \
$MODEL_NAME \
$MODEL_NAME/onnx \
--max-kv-cache-capacity 4096
This creates onnx/llm, onnx/visual, and onnx/action.
Note: Only FP16 is supported for Alpamayo export in this release. The
--max-kv-cache-capacityvalue must match the--maxKVCacheCapacityused when building the LLM engine in Step 3.
Step 2: Transfer to Device#
scp -r $MODEL_NAME/onnx \
<device_user>@<device_ip>:~/tensorrt-edgellm-workspace/$MODEL_NAME/
Step 3: Build Engines (Thor Device)#
export WORKSPACE_DIR=$HOME/tensorrt-edgellm-workspace
export MODEL_NAME=Alpamayo-R1-10B
cd ~/TensorRT-Edge-LLM
# Build language model engine
./build/examples/llm/llm_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/llm \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
--maxInputLen 3424 \
--maxKVCacheCapacity 4096 \
--maxBatchSize 6
# Build visual encoder engine
./build/examples/multimodal/visual_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/visual \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--minImageTokens 160 \
--maxImageTokens 18432 \
--maxImageTokensPerImage 192
# Build action expert engine
./build/examples/multimodal/action_build \
--onnxDir $WORKSPACE_DIR/$MODEL_NAME/onnx/action \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--maxBatchSize 6
visual_build writes $WORKSPACE_DIR/$MODEL_NAME/engines/visual/, and
action_build writes $WORKSPACE_DIR/$MODEL_NAME/engines/action/. Both
subdirectories are loaded through --multimodalEngineDir during inference.
Step 4: Run Inference (Thor Device)#
Create an input file $WORKSPACE_DIR/input_action.json. The trajectory
content item provides past egomotion history as [x, y, z] positions. Replace
the image paths with actual RGB image files. The checkpoint card describes the
primary setting as 4 cameras over 4 timesteps; the runtime parser accepts the
images as an ordered list of image content items.
{
"batch_size": 1,
"temperature": 1.0,
"top_p": 1.0,
"top_k": 50,
"max_generate_length": 128,
"requests": [
{
"messages": [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are a driving assistant that generates safe and accurate actions."
}
]
},
{
"role": "user",
"content": [
{"type": "image", "image": "/path/to/frame_00.png"},
{"type": "image", "image": "/path/to/frame_01.png"},
{"type": "image", "image": "/path/to/frame_02.png"},
{"type": "image", "image": "/path/to/frame_03.png"},
{"type": "image", "image": "/path/to/frame_04.png"},
{"type": "image", "image": "/path/to/frame_05.png"},
{"type": "image", "image": "/path/to/frame_06.png"},
{"type": "image", "image": "/path/to/frame_07.png"},
{"type": "image", "image": "/path/to/frame_08.png"},
{"type": "image", "image": "/path/to/frame_09.png"},
{"type": "image", "image": "/path/to/frame_10.png"},
{"type": "image", "image": "/path/to/frame_11.png"},
{"type": "image", "image": "/path/to/frame_12.png"},
{"type": "image", "image": "/path/to/frame_13.png"},
{"type": "image", "image": "/path/to/frame_14.png"},
{"type": "image", "image": "/path/to/frame_15.png"},
{
"type": "trajectory",
"trajectory": [
[-13.570470809936523, 0.06137955188751221, -0.021795138716697693],
[-12.617349624633789, 0.05502257123589516, -0.021485785022377968],
[-11.672354698181152, 0.04983367770910263, -0.019761288538575172],
[-10.733928680419922, 0.04469458386301994, -0.01891801320016384],
[-9.802403450012207, 0.03998061269521713, -0.019592544063925743],
[-8.87903118133545, 0.034502264112234116, -0.016824787482619286],
[-7.963467121124268, 0.03007129393517971, -0.011910118162631989],
[-7.054953098297119, 0.02519317716360092, -0.008388019166886806],
[-6.15440559387207, 0.020703254267573357, -0.003812047652900219],
[-5.261260032653809, 0.016889382153749466, 0.0007052596774883568],
[-4.374241828918457, 0.012481293641030788, 0.004680476617068052],
[-3.4911155700683594, 0.00914590060710907, 0.0029235705733299255],
[-2.611931085586548, 0.005678099580109119, 7.79956899350509e-05],
[-1.736931562423706, 0.003552841255441308, -0.0018224255181849003],
[-0.8664319515228271, 0.001968142343685031, -0.001738768769428134],
[0.0, 0.0, 0.0]
]
},
{
"type": "text",
"text": "output the chain-of-thought reasoning of the driving process, then output the future trajectory."
}
]
}
]
}
]
}
Run inference:
cd ~/TensorRT-Edge-LLM
./build/examples/multimodal/action_inference \
--engineDir $WORKSPACE_DIR/$MODEL_NAME/engines/llm \
--multimodalEngineDir $WORKSPACE_DIR/$MODEL_NAME/engines \
--inputFile $WORKSPACE_DIR/input_action.json \
--outputFile $WORKSPACE_DIR/output_action.json
Check output_action.json for the response. Each entry includes output_text (VLM language output) and output_trajectory (predicted (accel, kappa) acceleration and curvature pairs).