Use Multimodal Embedding with NeMo Retriever Library

Text-only NeMo Retriever embedding NIM

You can still use the NeMo Retriever text embedding NIM (OpenAI-compatible embeddings for passage and query vectors) alongside or instead of the multimodal flows on this page. Product and deployment details are in the NeMo Retriever Text Embedding NIM documentation. In library and CLI pipelines, route embedding to that NIM with your configured embed endpoint and model name (refer to the graph pipeline examples for environment-based remote inference).

This documentation describes how to use NeMo Retriever Library with the multimodal embedding model Llama Nemotron Embed VL 1B v2.

The Llama Nemotron Embed VL 1B v2 model is optimized for multimodal question-answering retrieval. The model can embed documents in the form of an image, text, or a combination of image and text. Documents can then be retrieved given a user query in text form. The model supports images that contain text, tables, charts, and infographics.

Example with Default Text-Based Embedding

When you use the multimodal model, by default, all extracted content (text, tables, charts) is treated as plain text. The following example provides a strong baseline for retrieval.

The embed method is called with no arguments.

For parameter details, refer to the Python API guide.

from nemo_retriever import create_ingestor

ingestor = (
    create_ingestor(run_mode="batch")
    .files("./data/*.pdf")
    .extract()
    .embed()  # Default behavior embeds all content as text
)
results = ingestor.ingest()

Example with Embedding Structured Elements as Text + Images

It is common to process PDFs by embedding standard text as text and embed visual elements such as tables and charts as images. The following example enables the multimodal model to capture the spatial and structural information of the visual content.

The embed method is configured with embed_modality="text_image" to embed the extracted tables and charts as images.
This configuration is more accurate than text only, with a performance cost.

For parameter details, refer to the Python API guide.

from nemo_retriever import create_ingestor

ingestor = (
    create_ingestor(run_mode="batch")
    .files("./data/*.pdf")
    .extract()
    .embed(
        embed_modality="text_image",
    )
)
results = ingestor.ingest()

Example with Embedding Entire PDF Pages as Images

For documents where the entire page layout is important (such as infographics, complex diagrams, or forms), you can configure NeMo Retriever Library to treat every page as a single image. The following example extracts and embeds each page as an image.

The embed method processes the page images.

For parameter details, refer to the Python API guide.

from nemo_retriever import create_ingestor

ingestor = (
    create_ingestor(run_mode="batch")
    .files("./data/*.pdf")
    .extract()
    .embed(
        embed_modality="image",
        embed_granularity="page",
    )
)
results = ingestor.ingest()

Use Multimodal Embedding with NeMo Retriever Library

Example with Default Text-Based Embedding

Example with Embedding Structured Elements as Text + Images

Example with Embedding Entire PDF Pages as Images

Related Topics