xr-render-demo — architecture#
This page describes the architecture of the xr-render-demo sample. For the
user-facing
quickstart, refer to the main README.
For inference-server mechanics shared with other samples, refer to
docs/ai-services.md.
Process stack#
The orchestrator (xr_render_demo, stdlib-only via xr-ai-launcher) starts
its processes concurrently. There is no startup ordering — every process
must tolerate peers that are not yet ready. run_stack is fail-fast: any
exit terminates the whole stack.
Role |
Directory |
Command |
Port |
|---|---|---|---|
hub |
|
|
8080 (https + wss /rtc proxy); LiveKit 7880 stays on 127.0.0.1 |
cloudxr |
|
|
48322 (WSS proxy) |
stt |
|
|
8103 |
tts |
|
|
8105 |
vlm |
|
|
8100 |
llm |
|
|
8106 |
agent-llm |
|
|
8107 |
vlm-mcp |
|
|
8240 |
video-mcp |
|
|
8210 |
render-mcp |
|
|
8220 |
oxr-mcp |
|
|
8230 |
vec-mcp |
|
|
8250 |
worker |
|
|
— |
Before starting the stack, the orchestrator runs two setup steps:
Web vendor bundle — builds the CloudXR + LiveKit ESM bundle via
client-samples/web-xr-build/build.sh(skipped if already present; requiresnpm).LOVR binary — auto-downloads LOVR v0.18.0 AppImage to
deps/lovr/if not present and sets$LOVR_BIN. Resolution order:$LOVR_BINenv var →lovr_bin:inrender_mcp.yaml→ cached AppImage → fresh download.
GPU pinning for the XR side#
gpu_index (int) in yaml/cloudxr_runtime.yaml selects the physical GPU
that the CloudXR compositor pins to. The cloudxr-runtime wrapper translates
the index to a PCI bus address via nvidia-smi and sets three selectors
(CUDA_VISIBLE_DEVICES, VK_LOADER_DEVICE_SELECT, DRI_PRIME) on its own
environment before spawning the native service. All three are required: the
compositor runs on Vulkan and needs the matching CUDA device for interop,
so on a multi-GPU host Vulkan and CUDA can otherwise land on different
physical GPUs.
The same three selectors are appended to cloudxr.env (under
~/.cloudxr/run/). render-mcp sources that file when it spawns LOVR, so
LOVR inherits the pin; oxr-mcp picks it up the same way.
If nvidia-smi is missing, fails, reports no GPUs, or does not list the
requested index, the wrapper logs a warning and skips pinning rather than
failing startup.
The corresponding model-side fields live under
agent-samples/model-servers/yaml/<profile>/. Set them to different GPUs so
the XR compositor and the agentic LLM do not share a card.
Worker configuration#
The worker reads two YAML files:
yaml/xr_render_demo_worker.yaml— MCP base URLs and VAD tunables.yaml/models.yaml(path set bymodels_yaml:in the worker YAML) — model endpoint declarations consumed byxr-ai-models. Each entry maps a logical name (llm,agent_llm,stt,tts,vlm) to akind: preset:<name>and abase_url. Edit this file to change which model runs where without touching the worker code.
The LLM servers#
Both are vLLM execvp shims — a small Python wrapper that reads YAML configuration,
sets HF_HOME and token environment variables, then os.execvps into vllm serve. The
Python process is replaced by vLLM; vLLM owns the HTTP API, weight loading,
and tool calling from that point on.
Llama-3.1-Nemotron-Nano-8B-v1 — port 8106 — fast reactive brain#
vllm serve with --tool-call-parser llama3_json --enable-auto-tool-choice.
enforce_eager defaults to false. Used for three cheap, latency-sensitive
calls — none of which actually use tool calling:
Quick-ack — fires in parallel with the agentic loop the moment an utterance lands. Returns
{"ack": "On it!", "think": false}— a 3–6 word spoken acknowledgment. Also classifies whether the request needs spatial reasoning (think: true/false), so the 30B model knows before it starts whether to engage its thinking budget. Max 40 tokens, 8s timeout. The ack is always sent on the data channel (agent.progresstopic); it is only also spoken via TTS whenthink=true, since that is when the user will actually be waiting 5–10s and needs to know they were heard.Still-working messages — if the agentic loop exceeds 5s, this model generates a short contextual phrase like “Still finding the right position” on a 7s repeat. Sent to the data channel only — never spoken, to avoid stacking up in the TTS queue behind the real response.
NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 — port 8107 — agentic loop#
vllm serve with --tool-call-parser qwen3_coder and
--reasoning-parser nano_v3 (plugin auto-fetched from the model card into
model_cache). enforce_eager defaults to true — CUDA graph capture +
FlashInfer FP4 MoE autotune silently takes 3–8 minutes on cold start without
it. Requires a Blackwell GPU (B200, RTX PRO 6000, or Jetson Thor) for native
FP4; swap to the BF16 variant for Hopper or Ampere.
This is the model that runs the multi-step tool-calling loop.
VLM — Cosmos-Reason1-7B#
Port 8100 (vlm-server) and port 8240 (vlm-mcp).
Loaded in-process by vlm-server via HuggingFace transformers
(Qwen2.5-VL architecture). <think>…</think> blocks are stripped before
returning. The vlm-mcp is a thin FastMCP wrapper exposing a single
ask_image(question, image_path) tool: it reads the PNG at that path,
base64-encodes it, and POSTs it to vlm-server as an image_url message.
Visual queries from the user are handled by the brain-local
look_at_current_frame(question) tool (see tool routing below), which turns
the camera on automatically, grabs the live frame, and calls vlm-server
directly — bypassing vlm-mcp entirely for the default perception path.
There is a deliberate startup ordering constraint: the worker’s
wait_for_services probe blocks on the VLM’s /health endpoint, which
returns 200 only after weights are fully loaded. This ensures GPU 0 memory
has settled before LOVR starts its Vulkan device, preventing a transient
OOM race.
STT — parakeet-tdt-0.6b-v3#
Port 8103. NeMo ASR in-process. English-only, ~1.5 GB VRAM.
LiveKit mic (int16 PCM) → hub IPC (float32) → XRMediaHubTransport.input()
→ SttProcessor
pre-roll buffer last 10 chunks (~320 ms) kept at all times;
prepended to the utterance buffer on speech onset
so the first word's attack isn't clipped
VAD Silero (ONNX, 512-sample / 32 ms windows,
probability threshold) via shared xr-ai-vad util
accumulates audio while speaking
finalizes when silence ≥ 0.8s AND speech ≥ 0.15s
OR max utterance length (30s) hit
filler filter drops single- and multi-word filler utterances
("um", "uh", "yeah", "okay", "mm-hmm", etc.)
STT call POST multipart/form-data WAV → stt-server :8103
→ TranscriptionFrame pushed downstream
STT calls are serialized — an stt_busy flag prevents a new finalize while
one is in-flight.
TTS — Piper#
Port 8105. rhasspy/piper-voices ONNX. Runs on CPU, ~100 ms per sentence. All
synthesis runs in a thread pool so the asyncio loop is never blocked.
TextFrame (from agentic loop final response, or quick-ack when think=true)
→ TtsProcessor
sentence-batched synthesis
POST text → tts-server :8105 → WAV bytes
RETURN_AUDIO IPC → hub → LiveKit → participant's headphones
allow_interruptions=True in the Pipecat pipeline. A new utterance while TTS
is playing triggers ReturnAudioFlush → hub clears the LiveKit audio queue
for that participant.
Pipecat pipeline#
XRMediaHubTransport.input()
→ SttProcessor (Silero VAD → utterance → parakeet STT
→ TranscriptionFrame)
→ RenderSceneProcessor (quick-ack + agentic loop → TextFrame)
→ TtsProcessor (TextFrame → Piper TTS → return audio)
→ XRMediaHubTransport.output()
Agentic loop#
At worker startup, list_tools() is called on all MCP clients
(render-mcp, oxr-mcp, vlm-mcp, video-mcp, vec-mcp). Results are
converted to OpenAI tool format and held in memory. start_xr and
get_health are excluded from the tool list — the worker calls those
directly, not the LLM.
On each TranscriptionFrame:
Quick-ack fires immediately (Llama-8B :8106, parallel task).
Still-working timer starts (fires at 5s, repeats every 7s, data channel only).
Pre-fetch (concurrent):
get_scene_state+get_head_pose+position_ahead(1.5)— results injected into the user message so the model skips those tool calls and goes straight to the operation.Nemotron-30B :8107 runs with
tools=[…], up to 10 iterations:Model emits
tool_calls→ worker routes and executes → result appended to conversation → next iteration.Tool routing:
look_at_current_frame→ brain-local (intercepts before MCP routing: turns camera on, grabs live frame, callsvlm-serverdirectly); oxr-mcp tools (get_head_pose,position_ahead,position_relative,place_user_relative,place_object_relative,place_inside_by_id,displace_object,displace_objects) →oxr-mcp; vec-mcp tools (between_anchors,world_offset,along_direction,scale_value) →vec-mcp;ask_image→vlm-mcp(with path existence guard); video tools →video-mcp; everything else →render-mcp.Progress message sent on
agent.progresstopic before each tool executes (data channel).If
think=true: reasoning preamble injected into system prompt (RESOLVE object → LOCATE coordinates → COMPUTE new position → EXECUTE). The<think>block stays private; only one short sentence goes to the user. Token budget: 2048 total, 1024 thinking budget.If thinking fills the token budget without a tool call (
finish_reason=length): retry the same iteration withneeds_thinking=False.If the model outputs a bare tool name as text instead of a proper tool call: worker synthesizes a no-arg tool call and continues.
Final response sent on
agent.responsetopic and as aTextFramedownstream to TTS.Turn appended to a rolling 4-turn history buffer — injected as context in future turns so the model understands “fix that”, “undo”, “the one I just added”.
MCP servers#
Server |
Port |
Tools |
|---|---|---|
|
8220 |
|
|
8230 |
|
|
8250 |
|
|
8240 |
|
|
8210 |
|
render-mcp owns the LOVR child process and is the only thing that pushes
ops onto LOVR’s scene socket (msgpack over ZMQ PUSH). oxr-mcp opens a
second headless OpenXR session (XR_MND_HEADLESS) separate from LOVR’s
rendering session — both coexist without contention; the session opens
lazily on first tool call.
Spatial tool surface#
The tool surface is split across oxr-mcp (pose-aware named-direction
helpers) and vec-mcp (pure-math primitives). The split offloads vector
arithmetic the LLM is bad at while keeping pose-dependent math in one place:
oxr-mcp named-direction helpers take a
directionenum (front,back,left,right,above,below, plusnext_toonplace_object_relative) and always-positivedistance. The LLM never applies signs to user-frame axes.place_user_relative(direction, distance): user-anchored teleport (“above my head”, “to my left 1 m”).place_object_relative(origin_x, origin_y, origin_z, direction, distance): object-anchored teleport.direction="front"means toward the user;"back"means away. Left/right/above/below map literally.displace_object(current_x, current_y, current_z, right, up, forward): user-frame signed-delta on an existing object. Multi-axis (“up and to the left”) in one call.displace_objects(object_ids, current_xs, current_ys, current_zs, right, up, forward): batch user-frame delta over N objects. Returns{"items": [{obj_id, x, y, z}, …]}so the model fans out to Nupdate_primitivecalls with one math call total.place_inside_by_id(movee_id, container_x, container_y, container_z): containment for “put X in Y”. Argument names (movee_idpaired withcontainer_*) force the model to pick the right noun’s coords; the return shape feeds straight intoupdate_primitive.
vec-mcp pure-math primitives are pose-independent:
between_anchors(a_x, a_y, a_z, b_x, b_y, b_z): component-wise midpoint.world_offset(origin_x, origin_y, origin_z, dx, dy, dz): axis-aligned world-Y-up shift.along_direction(origin_x, origin_y, origin_z, target_x, target_y, target_z, distance): origin moveddistancetoward target. Used for “closer to or further from”, which the user-frame helpers can’t model. scale_value(current, factor): scalar multiplication for sizes.
Prompt structure#
The system prompt at worker/prompts/system.txt is worked-example heavy.
It opens with pronoun and reference resolution, then routes placement
utterances through sequential checks before the LLM picks a tool:
FIRST CHECK:
"between"/"middle"/"halfway"→ route tobetween_anchors; stop considering other placement tools.SECOND CHECK: anchor is the user (
"me"/"my") → route toplace_user_relative;place_object_relativewithorigin=user_posreturns the wrong side of the user.THIRD CHECK: proximity to a named object (
"closer to <obj>","toward <obj>") → route toalong_direction. The user’s facing direction is unrelated to where the target object sits, sodisplace_objectis wrong here.
Every rule that’s not obviously self-explanatory has a paired WORKED
EXAMPLE (concrete coords + tool call) and, for the highest-leakage
failure modes, a WORKED ANTI-EXAMPLE. The two-step contract is
hammered: every move emits one math-tool call followed by exactly one
add_primitive/update_primitive call carrying all three of x,
y, z from the math result.
XR session lifecycle#
CloudXR returns XR_ERROR_FORM_FACTOR_UNAVAILABLE from xrGetSystem until
a streaming client connects. LOVR cannot start before then.
1. User opens https://<host>:8080, grants mic + XR permissions
2. User clicks "Launch XR"
3. Client sends `xr.session.started` data message → hub IPC → worker
4. Worker calls render-mcp `start_xr`
→ render-mcp spawns LOVR + waits for CloudXR in a background task
5. Worker polls `get_health` every 500 ms (up to 120s)
lovr_started: true → send `render.ready` to client → XR session unlocked
spawn_error: "..." → log + abort
6. On reconnect / refresh: `xr.session.started` arrives again
→ `_xr_started` is already True → skip spawn, send `render.ready`
immediately
Eval harness#
Offline regression suite for the agentic loop, run against the live model
stack (no LLM/MCP mocks; render-mcp tools are fake-succeeded so the live
LOVR scene is not mutated). Refer to
agent-samples/xr-render-demo/eval/README.md
for the case format and the watch-mode loop. Run with:
agent-samples/xr-render-demo/eval/eval.py
Prompt/eval overlap audit#
The harness audits the system prompt’s worked-example blocks against every
case fixture at startup and warns if they share specifics: verbatim user
utterances (≥12 chars), scene coordinates rendered as (x.xx, y.yy, z.zz),
recent_moves coords, or any reserved colour or shape word that appears in
both a case fixture and a worked-example block. This guards against the eval
cases overfitting to the prompt’s worked examples.