Quickstart: retriever CLI
Use
retriever ingestandretriever queryfor product-facing workflows.retriever pipeline runis development / compatibility only.
Quick start
Ingest a PDF locally
retriever ingest ./data/multimodal_test.pdf \
--method pdfium \
--extract-text --extract-tables --extract-charts \
--use-table-structure \
--embed-model-name nvidia/llama-nemotron-embed-1b-v2
Then query the default LanceDB table:
retriever query "What is in this document?" \
--embed-model-name nvidia/llama-nemotron-embed-1b-v2
By default, local ingest writes to lancedb/nemo-retriever and retriever query
reads from the same table.
The plain retriever query examples below apply to local and batch ingest output
written to LanceDB. Use retriever query service to query a Retriever service.
Ingest a larger corpus with batch mode
retriever ingest batch ./data/pdf_corpus \
--profile fast-text \
--pdf-extract-workers 4 \
--embed-workers 2
Batch mode exposes Ray runtime and batch tuning flags such as --ray-address,
--pdf-extract-workers, --ocr-workers, and --embed-workers.
Ingest through a Retriever service
retriever ingest service ./data/pdf_corpus \
--service-url http://localhost:7670 \
--service-concurrency 8
Use --service-api-token or NEMO_RETRIEVER_API_TOKEN when the service requires
a bearer token. Service ingest does not expose --lancedb-uri; the service
configures its vector database. Query the service with:
retriever query service "What is in this corpus?" \
--service-url http://localhost:7670
Route ingest to hosted or self-hosted NIM endpoints
export NVIDIA_API_KEY=nvapi-...
retriever ingest ./data/multimodal_test.pdf \
--page-elements-invoke-url https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-page-elements-v3 \
--ocr-invoke-url https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-ocr-v1 \
--table-structure-invoke-url https://ai.api.nvidia.com/v1/cv/nvidia/nemotron-table-structure-v1 \
--embed-invoke-url https://integrate.api.nvidia.com/v1/embeddings \
--embed-model-name nvidia/llama-nemotron-embed-1b-v2
NVIDIA_API_KEY is required only when those URLs point at hosted
build.nvidia.com endpoints. NGC_API_KEY is used separately when pulling or
running self-hosted NIM containers.
For NVIDIA inference hub rerank models that expose the Cohere-style rerank
route, pass the full /v1/rerank URL and the model name shown in the hub
snippet:
export NGC_INFERENCE_API_KEY=...
retriever query "What is in this document?" \
--embed-invoke-url https://integrate.api.nvidia.com/v1/embeddings \
--embed-model-name nvidia/llama-nemotron-embed-1b-v2 \
--reranker-invoke-url https://inference-api.nvidia.com/v1/rerank \
--reranker-model-name nvidia/nvidia/llama-3.2-nv-rerankqa-1b-v2 \
--reranker-api-key-env NGC_INFERENCE_API_KEY
Query result controls
Both retriever query and retriever query service return compact JSON hits
with source, page_number, and text. Use --candidate-k, --page-dedup,
and --content-types to control how results are selected after vector
retrieval:
retriever query "annual revenue by region" \
--top-k 5 \
--candidate-k 40 \
--content-types table
--top-k is the final number of results to return after filtering and
deduplication. --candidate-k is the number of raw results to retrieve from
LanceDB or the Retriever service before filtering, page deduplication, and
final truncation. If omitted, the candidate pool is the same size as
--top-k. Set --candidate-k larger than --top-k when page deduplication
or content-type filtering might remove too many of the nearest retrieved rows.
It must always be greater than or equal to --top-k.
Page deduplication and content-type filtering are applied after vector
retrieval, preserving retriever ranking order and truncating the final output to
--top-k. When querying a local table ingested with an explicit embedding
model, pass the same --embed-model-name to retriever query.
--content-types accepts comma-separated content types such as text, table,
chart, image, and infographic. images is accepted as an alias for
captioned image rows emitted by ingest. This option filters by content-type
metadata only; it does not filter by source, page, or other metadata
predicates. Hits with missing or unknown content-type metadata are excluded
while --content-types is active. In service mode, results must include
content-type metadata to match this filter. Default display values in the JSON
output are not used for content-type matching.
Agentic retrieval
--agentic swaps the single dense pass for an LLM-driven ReAct loop: the agent
issues several retrieval sub-queries, fuses the candidates, and selects a final
ranking. It searches the same LanceDB table built by retriever ingest, so it is
a drop-in alternative to standard retrieval — add --agentic and name the chat
model the agent drives with --agentic-llm-model (required):
retriever query "how does the ingestion pipeline handle tables?" \
--agentic \
--agentic-llm-model nvidia/llama-3.3-nemotron-super-49b-v1.5
# remote agent + embedding endpoints, fewer reasoning rounds
retriever query "summarize the deployment options" \
--agentic \
--agentic-llm-model nvidia/llama-3.3-nemotron-super-49b-v1.5 \
--agentic-invoke-url http://localhost:9000/v1/chat/completions \
--embed-invoke-url http://localhost:8000/v1 \
--agentic-react-max-steps 5
Unlike the dense path (which returns text-enriched hits), agentic mode returns
the agent's ranked document IDs as JSON, each annotated with the source that
produced it (final_results, rrf, or selection_agent). It reuses the same
--top-k, --lancedb-uri, --table-name, --embed-invoke-url, and
--embed-model-name options as standard retrieval.
How it works. Each agentic query runs Query → ReActAgentOperator → (RRF
fusion) → SelectionAgentOperator → ranked results:
ReActAgentOperatorruns the per-query ReAct loop; everyretrievetool call delegates to the standardRetriever, so the agent searches the same vector DB and embedding config as dense retrieval.RRFAggregatorOperatorfuses candidates from the loop's multiple searches with reciprocal rank fusion.SelectionAgentOperatorruns a final LLM selection pass over the fused set and emits the ranked document IDs.
Agentic-only knobs (apply only with --agentic):
--agentic-invoke-url— OpenAI-compatible chat-completions endpoint for the agent LLM; defaults to the operators' built-in endpoint when omitted.--agentic-reasoning-effort(defaulthigh) —reasoning_effortforwarded on agentic LLM calls.--agentic-backend-top-k(default20) — candidates pulled from the vector DB per retrieval call.--agentic-react-max-steps(default50) — maximum ReAct loop iterations.--agentic-text-truncation(default0) — max characters of each candidate shown to the agent;0disables truncation.--agentic-temperature(default0.0) — sampling temperature for agentic LLM calls (0.0= greedy).