Deployment options
Use this page to compare how you run NeMo Retriever — including when to use NVIDIA-hosted NIMs versus self-hosting on your own infrastructure.
Compare deployment options
Use the sections below to pick documentation and deployment options that match your goal.
I want to run locally or embed the library
- Pre-Requisites & Support Matrix
- Use the Python API or Use the CLI — install and run the
nemo_retrieverpackage in your environment
I want a Kubernetes / Helm deployment
- Pre-Requisites & Support Matrix
- NeMo Retriever Helm chart (supported): Deploy (Helm chart) — sources in
nemo_retriever/helmon GitHub - Published Library Helm charts (supported): cluster install and upgrade procedures are covered in the NeMo Retriever Library — use alongside the NeMo Retriever chart README for your release
- Environment variables and Troubleshoot as needed
Default NIMs in the published NeMo Retriever Library Helm chart (26.03): page_elements, table_structure, ocr, and vlm_embed (llama-nemotron-embed-vl-1b-v2:1.12.0). Nemotron Parse, Nemotron 3 Nano Omni, and the VL reranker are optional and disabled by default—enable them only when needed. See Pre-Requisites & Support Matrix — Default Helm NIMs.
Docker Compose (unsupported, developer-only): Docker Compose for local development — not a substitute for Helm or the published Library charts.
I want examples and notebooks
I need API details and keys
- Get your API key
- API reference — PDF pre-splitting if applicable
I am tuning performance or cost
When to use NVIDIA-hosted NIMs
NVIDIA-hosted NIMs run inference on NVIDIA-managed infrastructure. You call models with API keys (refer to Get your API key) without operating GPU nodes yourself.
Consider hosted NIMs when:
- You want the fastest path to try models and iterate without installing drivers, containers, or the NIM Operator on your own clusters.
- Latency to NVIDIA endpoints works for your region and use case.
- Your compliance and data policies allow document or query content in the hosted service (confirm with your security review).
Also refer to: NVIDIA NIM catalog
When to self-host NIMs
Self-hosted NIMs run on your GPUs or air-gapped hardware, typically with Kubernetes and the NIM Operator.
Consider self-hosting when:
- You need an air gap, strict data residency, or customer data must not leave your network.
- You run at large scale where dedicated capacity can cost less than hosted API usage.
- You must meet latency or locality requirements that hosted regions cannot satisfy.
GPU sharing. The NIM Operator supports time-slicing and MIG so multiple NIM workloads can share GPUs. A NIM used with NeMo Retriever Library does not always need a full dedicated GPU when the operator and GPU profile are set correctly. For scheduling and GPU partitioning, refer to the NIM Operator documentation.
Related
- Deploy (Helm chart) (
nemo_retriever/helmon GitHub) - NeMo Retriever Library — prerequisites / deployment (supported Helm handoff)
- Pre-Requisites & Support Matrix
- Docker Compose (unsupported): docker.md — local developer tooling only