Skip to content

Deployment options

Use this page to compare how you run NeMo Retriever — including when to use NVIDIA-hosted NIMs versus self-hosting on your own infrastructure.

Compare deployment options

Use the sections below to pick documentation and deployment options that match your goal.

I want to run locally or embed the library

  1. Pre-Requisites & Support Matrix
  2. Use the Python API or Use the CLI — install and run the nemo_retriever package in your environment

I want a Kubernetes / Helm deployment

  1. Pre-Requisites & Support Matrix
  2. NeMo Retriever Helm chart (supported): Deploy (Helm chart) — sources in nemo_retriever/helm on GitHub
  3. Published Library Helm charts (supported): cluster install and upgrade procedures are covered in the NeMo Retriever Library — use alongside the NeMo Retriever chart README for your release
  4. Environment variables and Troubleshoot as needed

Default NIMs in the published NeMo Retriever Library Helm chart (26.03): page_elements, table_structure, ocr, and vlm_embed (llama-nemotron-embed-vl-1b-v2:1.12.0). Nemotron Parse, Nemotron 3 Nano Omni, and the VL reranker are optional and disabled by default—enable them only when needed. See Pre-Requisites & Support Matrix — Default Helm NIMs.

Docker Compose (unsupported, developer-only): Docker Compose for local developmentnot a substitute for Helm or the published Library charts.

I want examples and notebooks

  1. Jupyter Notebooks
  2. Integrate with LangChain, LlamaIndex, Haystack

I need API details and keys

  1. Get your API key
  2. API reference — PDF pre-splitting if applicable

I am tuning performance or cost

  1. Evaluation and performance
  2. Throughput is dataset-dependent
  3. Evaluate on your data

When to use NVIDIA-hosted NIMs

NVIDIA-hosted NIMs run inference on NVIDIA-managed infrastructure. You call models with API keys (refer to Get your API key) without operating GPU nodes yourself.

Consider hosted NIMs when:

  • You want the fastest path to try models and iterate without installing drivers, containers, or the NIM Operator on your own clusters.
  • Latency to NVIDIA endpoints works for your region and use case.
  • Your compliance and data policies allow document or query content in the hosted service (confirm with your security review).

Also refer to: NVIDIA NIM catalog

When to self-host NIMs

Self-hosted NIMs run on your GPUs or air-gapped hardware, typically with Kubernetes and the NIM Operator.

Consider self-hosting when:

  • You need an air gap, strict data residency, or customer data must not leave your network.
  • You run at large scale where dedicated capacity can cost less than hosted API usage.
  • You must meet latency or locality requirements that hosted regions cannot satisfy.

GPU sharing. The NIM Operator supports time-slicing and MIG so multiple NIM workloads can share GPUs. A NIM used with NeMo Retriever Library does not always need a full dedicated GPU when the operator and GPU profile are set correctly. For scheduling and GPU partitioning, refer to the NIM Operator documentation.

Related