Skip to content

Workflow: Ingest documents into a searchable VDB collection

This page covers extracting content from documents and turning that content into a searchable vector collection in one place so you can scroll and search (for example with Ctrl+F) instead of jumping across multiple short workflow stubs.

Ingest and extract

Document ingestion is the step where NeMo Retriever Library reads your files (PDFs, Office documents, images, and other supported formats), runs extraction and optional enrichment, and returns structured content you can embed and index.

Follow these steps:

  1. Choose how you call the library. Use the Python API or CLI from application code, or run a deployment (for example NeMo Retriever Library on GitHub, Deployment options, or Quickstart: Kubernetes (Helm)) and send jobs over the network. Runnable examples appear in Choose how you call the library below.
  2. Use parallel PDF handling. The default ingest path splits large PDFs before Ray processing; refer to API guide — PDF pre-splitting.
  3. Tune extraction for your content. Refer to Multimodal extraction for formats, text and layout, tables, OCR, and related subsections on that page.

Pipeline concepts and stage overview appear in Key concepts. Default chunking behavior is summarized under Chunking.

create_ingestor(...) returns a GraphIngestor, which chains .extract(), .embed(), and .vdb_upload() into one graph. The Python example below stops after .embed() so you can inspect chunks first; append .vdb_upload(vdb_op="lancedb", vdb_kwargs={...}) before .ingest() to write directly to LanceDB (refer to Vector databases).

Choose how you call the library

The following examples match the NeMo Retriever Library README. They assume a checkout of the NeMo Retriever repository and the batch run mode with local GPU inference unless you configure remote NIMs.

Ingest a test PDF (Python)

The test PDF contains text, tables, charts, and images. The pipeline below chains .extract() and .embed() only so you can inspect embedded chunks before indexing. To upload in the same run, append .vdb_upload(...) before .ingest() (parameter details in the Python API guide).

from nemo_retriever import create_ingestor
from pathlib import Path

documents = [str(Path("data/multimodal_test.pdf"))]
ingestor = create_ingestor(run_mode="batch")

ingestor = (
    ingestor.files(documents)
    .extract(
        extract_text=True,
        extract_charts=True,
        extract_tables=True,
        extract_infographics=True,
    )
    .embed()
)

result = ingestor.ingest()  # ``pandas.DataFrame`` (``batch`` and ``inprocess``)

Run the above with your working directory at the repository root (so data/multimodal_test.pdf resolves), or adjust documents to the absolute path of the test PDF.

Next: Semantic retrieval when serving queries (also refer to Evaluate on your data for reranking and quality checks).