Quickstart: retriever CLI
Quick start
Local Docker Compose workflows are unsupported developer tooling only — see
docker.md (GitHub HEAD = default branch; pin to your release tag when not on main).
For supported deployment of NeMo Retriever / NIM containers, use nemo_retriever/helm and the NeMo Retriever Library Helm install guides.
Ingest a PDF
retriever pipeline run ./data/multimodal_test.pdf \
--input-type pdf \
--method pdfium \
--extract-text --extract-tables --extract-charts \
--store-images-uri ./processed_docs/images \
--save-intermediate ./processed_docs
For a lightweight PDF-only workflow:
retriever ingest ./data/multimodal_test.pdf
retriever query "What is in this document?"
Route stages to self-hosted or hosted NIM endpoints by passing only the URLs you want to override:
export NVIDIA_API_KEY=nvapi-...
retriever ingest ./data/multimodal_test.pdf \
--page-elements-invoke-url https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-page-elements-v3 \
--ocr-invoke-url https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-ocr-v1 \
--ocr-version v1 \
--graphic-elements-invoke-url https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-graphic-elements-v1 \
--table-structure-invoke-url https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-table-structure-v1 \
--embed-invoke-url https://integrate.api.nvidia.com/v1/embeddings \
--embed-model-name nvidia/llama-nemotron-embed-1b-v2
retriever query "What is in this document?" \
--embed-invoke-url https://integrate.api.nvidia.com/v1/embeddings \
--embed-model-name nvidia/llama-nemotron-embed-1b-v2 \
--reranker-invoke-url https://ai.api.nvidia.com/v1/retrieval/nvidia/llama-nemotron-rerank-vl-1b-v2/reranking
NVIDIA_API_KEY is required only when those URLs point at hosted
build.nvidia.com endpoints. NGC_API_KEY is used separately when pulling or
running self-hosted NIM containers.
What you get
- Extracted text, tables, and charts as rows in LanceDB at
./lancedb(default table namenv-ingest). - Per-document Parquet under
./processed_docs/(--save-intermediate). - Image assets under
./processed_docs/images/(--store-images-uri). - Progress and stage logs on stderr.
Inspect the results
ls ./processed_docs
ls ./processed_docs/images
ls ./lancedb
import pyarrow.parquet as pq
import lancedb
df = pq.read_table("./processed_docs").to_pandas()
print(df.head())
db = lancedb.connect("./lancedb")
tbl = db.open_table("nv-ingest")
print(tbl.to_pandas().head())
Or query via the Retriever Python client (nemo_retriever/README.md):
from nemo_retriever.retriever import Retriever
retriever = Retriever(lancedb_uri="lancedb", lancedb_table="nv-ingest", top_k=5)
hits = retriever.query(
"Given their activities, which animal is responsible for the typos?"
)
Larger datasets
- Batch ingest:
retriever ingest ./data/pdf_corpus --run-mode batch. - Tune throughput with
--pdf-extract-workers,--pdf-extract-batch-size,--page-elements-workers,--page-elements-batch-size,--ocr-workers,--ocr-batch-size,--embed-workers, and--embed-batch-size. - For CI or debugging:
--run-mode inprocessskips Ray startup.