Use Multimodal Embedding with NeMo Retriever Extraction
This documentation describes how to use NeMo Retriever extraction with the multimodal embedding model Llama 3.2 NeMo Retriever Multimodal Embedding 1B.
The Llama 3.2 NeMo Retriever Multimodal Embedding 1B
model is optimized for multimodal question-answering retrieval.
The model can embed documents in the form of an image, text, or a combination of image and text.
Documents can then be retrieved given a user query in text form.
The model supports images that contain text, tables, charts, and infographics.
Note
NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
Configure and Run the Multimodal NIM
Use the following procedure to configure and run the multimodal embedding NIM locally.
-
Set the embedding model in your .env file. This tells NeMo Retriever extraction to use the Llama 3.2 Multimodal model instead of the default text-only embedding model.
EMBEDDING_IMAGE=nvcr.io/nvidia/nemo-microservices/llama-3.2-nemoretriever-1b-vlm-embed-v1 EMBEDDING_TAG=1.7.0 EMBEDDING_NIM_MODEL_NAME=nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1
-
Start the NeMo Retriever extraction services. The multimodal embedding service is included by default.
docker compose --profile retrieval --profile table-structure up
After the services are running, you can interact with the extraction pipeline by using Python.
The key to leveraging the multimodal model is
to configure the extract
and embed
methods to process different content types as either text or images.
Example with Default Text-Based Embedding
When you use the multimodal model, by default, all extracted content (text, tables, charts) is treated as plain text. The following example provides a strong baseline for retrieval.
- The
extract
method is configured to pull out text, tables, and charts. - The
embed
method is called with no arguments.
ingestor = (
Ingestor()
.files("./data/*.pdf")
.extract(
extract_text=True,
extract_tables=True,
extract_charts=True,
extract_images=False,
)
.embed() # Default behavior embeds all content as text
)
results = ingestor.ingest()
Example with Embedding Structured Elements as Images
It is common to process PDFs by embedding standard text as text, and embed visual elements like tables and charts as images. The following example enables the multimodal model to capture the spatial and structural information of the visual content.
- The
extract
method is configured to pull out text, tables, and charts. - The
embed
method is configured withstructured_elements_modality="image"
to embed the extracted tables and charts as images.
ingestor = (
Ingestor()
.files("./data/*.pdf")
.extract(
extract_text=True,
extract_tables=True,
extract_charts=True,
extract_images=False,
)
.embed(
structured_elements_modality="image",
)
)
results = ingestor.ingest()
Example with Embedding Entire PDF Pages as Images
For documents where the entire page layout is important (such as infographics, complex diagrams, or forms), you can configure NeMo Retriever extraction to treat every page as a single image. The following example extracts and embeds each page as an image.
Note
The extract_page_as_image
feature is experimental. Its behavior may change in future releases.
- The
extract method
uses theextract_page_as_image=True
parameter. All other extraction types are set toFalse
. - The
embed method
processes the page images.
ingestor = (
Ingestor()
.files("./data/*.pdf")
.extract(
extract_text=False,
extract_tables=False,
extract_charts=False,
extract_images=False,
extract_page_as_image=True,
)
.embed(
image_elements_modality="image",
)
)
results = ingestor.ingest()