Use a Local Inference Server#

NemoClaw can route inference to a model server running on your machine instead of a cloud API. This page covers Ollama, compatible-endpoint paths for other servers, and two experimental options for vLLM and NVIDIA NIM.

All approaches use the same inference.local routing model. The agent inside the sandbox never connects to your model server directly. OpenShell intercepts inference traffic and forwards it to the local endpoint you configure.

Prerequisites#

  • NemoClaw installed. Refer to the Quickstart if you have not installed yet.

  • A local model server running, or Ollama installed. The NemoClaw onboard wizard can also start Ollama for you.

Ollama#

Ollama is the default local inference option. The onboard wizard detects Ollama automatically when it is installed or running on the host.

If Ollama is not running, NemoClaw starts it for you. On macOS, the wizard also offers to install Ollama through Homebrew if it is not present.

Run the onboard wizard.

$ nemoclaw onboard

Select Local Ollama from the provider list. NemoClaw lists installed models or offers starter models if none are installed. It pulls the selected model, loads it into memory, and validates it before continuing.

Linux with Docker#

On Linux hosts that run NemoClaw with Docker, the sandbox reaches Ollama through http://host.openshell.internal:11434, not the host shell’s localhost socket. If Ollama is already running, make sure it listens on 0.0.0.0:11434 instead of 127.0.0.1:11434.

$ OLLAMA_HOST=0.0.0.0:11434 ollama serve

If Ollama only binds loopback, NemoClaw can detect it on the host, but the sandbox-side validation step fails because containers cannot reach it.

Non-Interactive Setup#

$ NEMOCLAW_PROVIDER=ollama \
  NEMOCLAW_MODEL=qwen2.5:14b \
  nemoclaw onboard --non-interactive

If NEMOCLAW_MODEL is not set, NemoClaw selects a default model based on available memory.

Variable

Purpose

NEMOCLAW_PROVIDER

Set to ollama.

NEMOCLAW_MODEL

Ollama model tag to use. Optional.

OpenAI-Compatible Server#

This option works with any server that implements /v1/chat/completions, including vLLM, TensorRT-LLM, llama.cpp, LocalAI, and others.

Start your model server. The examples below use vLLM, but any OpenAI-compatible server works.

$ vllm serve meta-llama/Llama-3.1-8B-Instruct --port 8000

Run the onboard wizard.

$ nemoclaw onboard

When the wizard asks you to choose an inference provider, select Other OpenAI-compatible endpoint. Enter the base URL of your local server, for example http://localhost:8000/v1.

The wizard prompts for an API key. If your server does not require authentication, enter any non-empty string (for example, dummy).

NemoClaw validates the endpoint by sending a test inference request before continuing.

Non-Interactive Setup#

Set the following environment variables for scripted or CI/CD deployments.

$ NEMOCLAW_PROVIDER=custom \
  NEMOCLAW_ENDPOINT_URL=http://localhost:8000/v1 \
  NEMOCLAW_MODEL=meta-llama/Llama-3.1-8B-Instruct \
  COMPATIBLE_API_KEY=dummy \
  nemoclaw onboard --non-interactive

Variable

Purpose

NEMOCLAW_PROVIDER

Set to custom for an OpenAI-compatible endpoint.

NEMOCLAW_ENDPOINT_URL

Base URL of the local server.

NEMOCLAW_MODEL

Model ID as reported by the server.

COMPATIBLE_API_KEY

API key for the endpoint. Use any non-empty value if authentication is not required.

Anthropic-Compatible Server#

If your local server implements the Anthropic Messages API (/v1/messages), choose Other Anthropic-compatible endpoint during onboarding instead.

$ nemoclaw onboard

For non-interactive setup, use NEMOCLAW_PROVIDER=anthropicCompatible and set COMPATIBLE_ANTHROPIC_API_KEY.

$ NEMOCLAW_PROVIDER=anthropicCompatible \
  NEMOCLAW_ENDPOINT_URL=http://localhost:8080 \
  NEMOCLAW_MODEL=my-model \
  COMPATIBLE_ANTHROPIC_API_KEY=dummy \
  nemoclaw onboard --non-interactive

vLLM Auto-Detection (Experimental)#

When vLLM is already running on localhost:8000, NemoClaw can detect it automatically and query the /v1/models endpoint to determine the loaded model.

Set the experimental flag and run onboard.

$ NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard

Select Local vLLM [experimental] from the provider list. NemoClaw detects the running model and validates the endpoint.

Note

NemoClaw forces the chat/completions API path for vLLM. The vLLM /v1/responses endpoint does not run the --tool-call-parser, so tool calls arrive as raw text.

Non-Interactive Setup#

$ NEMOCLAW_EXPERIMENTAL=1 \
  NEMOCLAW_PROVIDER=vllm \
  nemoclaw onboard --non-interactive

NemoClaw auto-detects the model from the running vLLM instance. To override the model, set NEMOCLAW_MODEL.

NVIDIA NIM (Experimental)#

NemoClaw can pull, start, and manage a NIM container on hosts with a NIM-capable NVIDIA GPU.

Set the experimental flag and run onboard.

$ NEMOCLAW_EXPERIMENTAL=1 nemoclaw onboard

Select Local NVIDIA NIM [experimental] from the provider list. NemoClaw filters available models by GPU VRAM, pulls the NIM container image, starts it, and waits for it to become healthy before continuing.

Note

NIM uses vLLM internally. The same chat/completions API path restriction applies.

Non-Interactive Setup#

$ NEMOCLAW_EXPERIMENTAL=1 \
  NEMOCLAW_PROVIDER=nim \
  nemoclaw onboard --non-interactive

To select a specific model, set NEMOCLAW_MODEL.

Verify the Configuration#

After onboarding completes, confirm the active provider and model.

$ nemoclaw <name> status

The output shows the provider label (for example, “Local vLLM” or “Other OpenAI-compatible endpoint”) and the active model.

Switch Models at Runtime#

You can change the model without re-running onboard. Refer to Switch Inference Models for the full procedure.

For compatible endpoints, the command is:

$ openshell inference set --provider compatible-endpoint --model <model-name>

If the provider itself needs to change (for example, switching from vLLM to a cloud API), rerun nemoclaw onboard.

Next Steps#