AI inference servers#
Read this when calling or operating an inference server. For the orchestrator pattern that wires servers into a sample, refer to Adding a new sample.
Multiple reusable HTTP servers are available as launchable peers of
server-runtime/. All expose an OpenAI-compatible REST API so agent workers
can call them with any OpenAI SDK client or plain httpx or requests.
Reference services cover vision-language reasoning, speech recognition,
text-to-speech, and large language models. Three LLM backends ship
side-by-side under ai-services/llm/ — pick one per sample based on the
tool-calling, reasoning, and hardware trade-offs documented below.
Server |
Command |
Port |
Model |
Backend |
|---|---|---|---|---|
|
|
8100 |
Cosmos-Reason1-7B |
vLLM (pip or docker) |
|
|
8103 |
parakeet-tdt-0.6b-v3 |
NeMo ASR in-process |
|
|
8104 |
magpie_tts_multilingual_357m |
NeMo TTS in-process |
|
|
8105 |
rhasspy/piper-voices (ONNX) |
piper-tts in-process |
|
|
8106 |
Llama-3.1-Nemotron-Nano-8B-v1 |
vLLM (pip or docker) |
|
|
8107 |
NVIDIA-Nemotron-3-Nano-30B-A3B-{NVFP4,FP8} |
vLLM (pip or docker) |
|
|
8108 |
Nemotron-3-Nano-Omni-30B-A3B-Reasoning (NVFP4, FP8, or BF16, GPU-selected) |
vLLM (pip or docker) — multimodal (text + video) |
|
|
8200 |
— |
JSONL + FastMCP |
|
|
8210 |
— |
FastMCP → XR-Media-Hub |
|
|
8240 |
— |
FastMCP → vlm-server ( |
All model weights land in models/ at the repository root (not checked into version control, shared across
all servers). Each YAML configures model_cache — resolved relative to the
YAML file.
Adding a server to a sample#
1 — Add the process to the orchestrator:
PROCESSES = [
Process("hub", "../../server-runtime", "xr_media_hub"),
Process("vlm", "../../ai-services/vlm-server", "vlm_server"), # ← add as needed
# Pick ONE LLM backend per sample — they bind different default ports
# (8106 / 8107) so running more than one at once is allowed but
# usually unnecessary.
Process("llm", "../../ai-services/llm/llama_nemotron", "llama_nemotron_llm_server"),
# Process("llm", "../../ai-services/llm/nemotron3_nano", "nemotron3_nano_llm_server"),
Process("stt", "../../ai-services/stt-server", "stt_server"),
# Pick one TTS server
Process("tts", "../../ai-services/tts/piper", "piper_tts_server"),
# Process("tts", "../../ai-services/tts/magpie", "magpie_tts_server"),
Process("worker", "worker", "my_agent_worker"),
]
The agent samples in this repository (simple-vlm-example and xr-render-demo)
default to Piper TTS — it runs on CPU with ~100 ms/sentence latency and avoids
the NeMo dep tree. Magpie is still a supported NVIDIA TTS option with better
voice quality and multilingual support when GPU is available; swap the
Process row and YAML.
2 — Copy the reference YAML to your sample’s yaml/ directory:
mkdir -p yaml
cp ../../ai-services/vlm-server/vlm_server.yaml ./yaml/vlm_server.yaml
# Pick ONE LLM YAML — copy the one matching the Process you picked above.
cp ../../ai-services/llm/llama_nemotron/llama_nemotron_llm_server.yaml ./yaml/llama_nemotron_llm_server.yaml
# cp ../../ai-services/llm/nemotron3_nano/nemotron3_nano_llm_server.yaml ./yaml/nemotron3_nano_llm_server.yaml
cp ../../ai-services/stt-server/stt_server.yaml ./yaml/stt_server.yaml
cp ../../ai-services/tts/piper/piper_tts_server.yaml ./yaml/piper_tts_server.yaml
# Or for Magpie (multilingual, GPU, ~2-5 s/sentence):
cp ../../ai-services/tts/magpie/magpie_tts_server.yaml ./yaml/magpie_tts_server.yaml
# MCP servers:
cp ../../agent-mcp-servers/transcript-mcp/transcript_mcp_server.yaml ./yaml/transcript_mcp_server.yaml
cp ../../agent-mcp-servers/video-mcp/video_mcp_server.yaml ./yaml/video_mcp_server.yaml
Edit the YAML as needed (model, port, device, etc.). The launcher auto-discovers
yaml/<command>.yaml in the sample root and passes it as --config.
Calling these from a worker#
Workers do not hand-roll httpx clients against these endpoints. They
depend on agent-sdk/xr-ai-models,
load a per-sample yaml/models.yaml, and construct service clients via
make_llm, make_vlm, make_stt, and make_tts. The SDK encapsulates the
OpenAI-compatible wire format and the per-model quirks (reasoning-field
aliasing, chat_template_kwargs, served-model-name strings) so callers
never branch on backend.
from xr_ai_models import load_models_config, make_llm, ChatMessage
config = load_models_config("yaml/models.yaml")
async with make_llm(config, "agent_llm") as llm:
resp = await llm.chat(
[ChatMessage(role="user", content="hello")],
max_tokens=128,
enable_thinking=True,
)
print(resp.content, resp.reasoning)
A matching models.yaml for the four built-in service backends:
agent_llm:
kind: preset:nemotron3_nano
base_url: http://localhost:8107
vlm:
kind: preset:cosmos_vlm
base_url: http://localhost:8100
stt:
kind: preset:parakeet_stt
base_url: http://localhost:8103
tts:
kind: preset:piper_tts
base_url: http://localhost:8105
Swapping a backend is a kind: + base_url: edit in YAML; worker code does
not change. Full protocol surface, the preset table, and the explicit
(no-preset) specification are in
agent-sdk/xr-ai-models/README.md.
Hosting models on NVIDIA NIM#
The LLM and VLM can run on NVIDIA NIM instead of
local vLLM — NIM exposes the same OpenAI-compatible /v1/chat/completions
API, so this is a models.yaml change with no worker code edits. STT and TTS
stay local: hosted NIM speech (Riva) is not OpenAI /v1/audio-compatible.
A NIM model entry differs from a local one in three fields:
vlm:
kind: openai_compat
category: vlm
base_url: https://integrate.api.nvidia.com # client appends /v1/...
model_name: nvidia/cosmos-reason1-7b # confirm slug at build.nvidia.com
api_key_env: NGC_API_KEY # → Authorization: Bearer
health_check: false # hosted NIM has no /health
capabilities: { vision: true, streaming: true }
api_key_env: NGC_API_KEYsends the key as a bearer token. The key is a managed credential —run_stackinjects a savedNGC_API_KEYinto every subprocess (refer todocs/credentials.md); or export it.health_check: falseis required for hosted endpoints — they have no local/healthroute, so the worker readiness gate must not probe them. (Default istruefor local servers.)model_nameis the hosted model id from build.nvidia.com.
Each sample ships a ready-made yaml/models.nim.yaml overlay, selected by a
single key — no main.py edits. To switch a sample to NIM:
Set
model_backend: nimin the sample’s*_worker.yaml(defaultlocal). The worker then loadsmodels.nim.yaml, and the orchestrator (which reads the same key) skips the local model server(s) NIM replaces — for xr-render-demo it also pointsvlm-mcpatyaml/vlm_mcp_server.nim.yaml.Provide
NGC_API_KEY— in NIM mode the orchestrator prompts for it once if it isn’t already saved or exported.For xr-render-demo, run the demo without the local
llm,agent-llm, andvlmmodel-servers (they’relaunch_mode="reuse", so just don’t start them in the model-servers stack).
Set model_backend: local to switch back.
Self-hosted NIM containers work the same way: point base_url at the
container (e.g. http://localhost:8000) and set health_check: true if it
exposes /v1/health.
vLLM model persistence#
The persistent vLLM-backed servers (vlm_server, llama_nemotron_llm_server,
nemotron3_nano_llm_server) survive stack restarts by design.
nemotron_omni_llm_server is foreground (dies with the wrapper). Each
persistent wrapper script checks its health endpoint before spawning vLLM:
Already running → touch the ready file immediately, then idle. Stack is ready in seconds; no model reload.
Not running → spawn vLLM normally, wait for
/health, touch ready file.
In pip mode, vLLM is spawned with start_new_session=True so the launcher’s
killpg() does not reach it on shutdown. In docker mode, the container is
launched detached (docker run -d --name xr-ai-vllm-<service>) so it
similarly outlives the wrapper. Either way the wrapper exits cleanly and
vLLM keeps running.
Stopping the persisted servers — run from the sample directory:
uv run xr_render_demo --stop
This hits each model server’s /health endpoint, then either runs
docker stop <container_name> (docker-mode servers) or finds the listening
PID via ss or lsof and sends SIGTERM (pip-mode), escalating to
docker kill or SIGKILL after 20 s. It is safe to run while the stack is
down — processes and containers that are not running are silently skipped.
The target ports and container names match the defaults in the per-profile YAML files.
Choosing the vLLM runtime (pip vs Docker)#
All four vLLM-backed servers (vlm_server, llama_nemotron_llm_server,
nemotron3_nano_llm_server, nemotron_omni_llm_server) accept a
vllm_backend: key in their YAML to pick how vLLM is hosted:
|
Runtime |
Default |
Use when |
|---|---|---|---|
|
|
yes |
Standard development; fastest iteration; works offline once weights are cached. |
|
|
no |
Trying NVIDIA’s optimized vLLM container; pinning a specific NGC release; reproducing a deployment image. |
Both modes honor identical configuration keys — same model, same port, same vLLM
flags. The dispatcher lives in utils/xr-ai-vllm/. Switching is one YAML edit:
vllm_backend: docker
vllm_image: nvcr.io/nvidia/vllm:26.04-py3
vllm_image: defaults to nvcr.io/nvidia/vllm:26.04-py3; override to pin
another tag, an internal mirror, or a custom build.
docker mode — prerequisites#
Docker Engine with the user in the
dockergroup (docker versionmust succeed withoutsudo).NVIDIA Container Toolkit so
--gpusworks: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.htmlNGC pull access for
nvcr.io/nvidia/vllm. The wrapper auto-runsdocker login nvcr.ioifNGC_API_KEYis in the environment (loaded byload_credentials()from~/.config/xr-ai/credentials.jsonperdocs/credentials.md). Otherwise, log in manually once:docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY
Existing ~/.docker/config.json entries take priority and are not overwritten.
docker mode — runtime details#
Container is launched with
--network host --ipc host --gpus …(matches gives vLLM the shared-memory region its workers expect).The host
model_cacheis bind-mounted at the same path inside the container andHF_HOMEis set to it, so weights cached by pip mode are reused by docker mode and vice versa.Container name is deterministic per service:
xr-ai-vllm-vlm-server,xr-ai-vllm-llama-nemotron-llm-server,xr-ai-vllm-nemotron3-nano-llm-server,xr-ai-vllm-nemotron-omni-llm-server.Persistence parity:
vlm_server,llama_nemotron_llm_server, andnemotron3_nano_llm_serverrun detached (docker run -d --rm --name …) so the container survives stack restarts, mirroring their pip-modestart_new_session=Truebehavior.nemotron_omni_llm_serverruns foreground (container exits with the wrapper) — same as its pip-mode semantics.
Cleanup#
uv run xr_render_demo --stop works for both modes. The cleanup path probes
/health first; for docker mode it then runs docker stop <container_name>
(escalating to docker kill after 20 s); for pip mode it falls back to the
port → PID → SIGTERM/SIGKILL path. Same UX for both.
Per-server notes#
vlm-server is a thin launcher around
vllm servefor Cosmos-Reason1-7B (or any Qwen2.5-VL-compatible VLM). vLLM handles weight loading, image decoding, and the OpenAI-compatible HTTP API. Hosting backend is selectable per YAML — refer to Choosing the vLLM runtime above.llm/llama_nemotron is a thin wrapper around
vllm serveforLlama-3.1-Nemotron-Nano-8B-v1. vLLM handles native Llama-3.1 tool calling via thellama3_jsonparser —tools=[...]in the request is rendered via the model’s chat template and the resulting tool calls come back in OpenAI wire format (finish_reason: "tool_calls"). Per-turn reasoning toggle via"detailed thinking on"or"detailed thinking off"in a system or user message; reasoning preamble is not stripped server-side. Hosting backend is selectable per YAML (refer to Choosing the vLLM runtime). Refer toai-services/llm/llama_nemotron/README.mdfor the full HTTP contract and tuning knobs.llm/nemotron3_nano is a thin wrapper around
vllm serveforNVIDIA-Nemotron-3-Nano-30B-A3B-{NVFP4,FP8}(auto-selected by GPU compute capability). vLLM handles tool calling (qwen3_coderparser), reasoning extraction (nano_v3parser — auto-fetched intomodel_cache), and FlashInfer FP4 MoE kernels. Requires a Blackwell-class GPU (B200 or RTX PRO 6000) for native FP4; swap to the FP8 or BF16 variants for Hopper and Ampere.enforce_eager: trueby default to avoid the silent 3–8 min CUDA graph and FlashInfer autotune on cold start. Hosting backend is selectable per YAML (refer to Choosing the vLLM runtime). Refer toai-services/llm/nemotron3_nano/README.mdfor the vLLM flags it forwards and Blackwell prerequisites.llm/nemotron_omni is a vLLM-backed multimodal LLM serving
Nemotron-3-Nano-Omni-30B-A3B-Reasoning(text + video input) at port 8108. The YAML auto-selects between three model variants by detected GPU compute capability: NVFP4 on Blackwell (SM100+), FP8 on Ada and Hopper, BF16 forced viause_bf16: truefor highest quality at the largest VRAM cost. Same OpenAI-compatible HTTP contract as the other LLM servers — swap the port to swap backends. Hosting backend is selectable per YAML (refer to Choosing the vLLM runtime); runs foreground in both pip and docker modes (no cross-restart persistence).stt-server loads parakeet-tdt-0.6b-v3 via NeMo ASR in-process. English-only; the
languageandtemperatureform fields are accepted but ignored.tts/magpie loads magpie_tts_multilingual_357m via NeMo TTS in-process.
tts/piper serves any rhasspy/piper-voices ONNX voice; ~100 ms/sentence on CPU. All inference runs in a thread pool so the asyncio loop is never blocked.
transcript-mcp-server is pure FastMCP at
/mcpon port 8200. Records are keyed by free-formsource_id(live participant identity or an internal source name like"agent-vlm"). Tools:query_transcripts,add_transcript(worker ingest),list_sources,get_transcript_stats. Transcripts persist as JSONL alongside a.identitysidecar so list and query round-trip raw IDs cleanly even when sanitized filenames collide.video-mcp-server is pure FastMCP at
/mcpon port 8210. Connects to the hub as aProcessorEndpoint(Subscribe.VIDEO) for live frames. Tools exposed depend on whetherrecordings_diris set in the YAML:Always:
list_live_participants.Recording disabled:
get_latest_frame(live IPC frame, no recording needed).Recording enabled (
recordings_dirset,video_recording.enabled: trueinxr_media_hub.yamlwith a matchingout_dir):get_frame_from_time,list_recorded_participants,get_video_stats,query_video(historical chunk lookup via NVDEC).
Ports are configurable — avoid conflicts with LiveKit (7880–7882) and hub (8080, 8090).
Sample YAMLs for each service ship in their own service directory. Copy them to your sample root and adjust
model_cache(../../modelsresolves toxr-ai/models/from anyagent-samples/<name>/directory).