Is this page helpful?

Use Custom Metadata to Filter Search Results

You can upload custom metadata for documents during ingestion. By uploading custom metadata you can attach additional information to documents, and use it for filtering results during retrieval operations. For example, you can add author metadata to your documents, and filter by author when you retrieve results. To create filters at query time, use predicates supported by LanceDB SQL against your table schema (custom fields are serialized into the metadata column with your ingested chunks). For a worked example, see the repository notebook linked at the end of this page.

Use this documentation to use custom metadata to filter search results when you work with NeMo Retriever Library.

Limitations

The following are limitation when you use custom metadata:

Metadata fields must be consistent across documents in the same collection.
Complex filter expressions may impact retrieval performance.
If you update your custom metadata, you must ingest your documents again to use the new metadata.

Add Custom Metadata During Ingestion

You can add custom metadata during the document ingestion process. You can specify metadata for each file, and you can specify different metadata for different documents in the same ingestion batch.

Metadata Structure

You specify custom metadata as a dataframe or a file (json, csv, or parquet).

The following example contains metadata fields for category, department, and timestamp. You can create whatever metadata is helpful for your scenario.

import pandas as pd

meta_df = pd.DataFrame(
    {
        "source": ["data/woods_frost.pdf", "data/multimodal_test.pdf"],
        "category": ["Alpha", "Bravo"],
        "department": ["Language", "Engineering"],
        "timestamp": ["2025-05-01T00:00:00", "2025-05-02T00:00:00"]
    }
)

# Convert the dataframe to a csv file, 
# to demonstrate how to ingest a metadata file in a later step.

file_path = "./meta_file.csv"
meta_df.to_csv(file_path)

Example: Add Custom Metadata During Ingestion

The following example adds custom metadata during ingestion. For more information about create_ingestor and run modes, refer to Use the Python API. For more information about the vdb_upload method, refer to Upload Data.

from nemo_retriever import create_ingestor

# Service-backed pipeline: point `base_url` at your running retriever service.
# For local graph execution instead, see [Use the Python API](nemo-retriever-api-reference.md).

hostname = "localhost"
table_name = "nemo_retriever_collection"
lancedb_uri = "./lancedb_data"

ingestor = (
    create_ingestor(run_mode="service", base_url=f"http://{hostname}:7670")
        .files(["data/woods_frost.pdf", "data/multimodal_test.pdf"])
        .extract(
            extract_text=True,
            extract_tables=True,
            extract_charts=True,
            extract_images=True,
            text_depth="page"
        )
        .embed()
        .vdb_upload(
            vdb_op="lancedb",
            uri=lancedb_uri,
            table_name=table_name,
            hybrid=False,
        )
)
results = ingestor.ingest_async().result()

Merge values from meta_df (or file_path) into each document's content_metadata before vdb_upload, or follow the step-by-step pattern in metadata_and_filtered_search.ipynb, so category, department, and timestamp are present on the chunks LanceDB indexes.

Best Practices

The following are the best practices when you work with custom metadata:

Plan metadata structure before ingestion.
Test filter expressions with small datasets first.
Consider performance implications of complex filters.
Validate metadata during ingestion.
Handle missing metadata fields gracefully.
Log invalid filter expressions.

Use Custom Metadata to Filter Results During Retrieval

You can use custom metadata to filter documents during retrieval operations. For predicate pushdown, use LanceDB SQL on an opened table (see the native query sketch below). The lancedb_retrieval helper does not accept a server-side filter: it always returns up to top_k hits from the index, so any list comprehension over those hits is application-side only—raise top_k if your matches might sit outside the first top_k neighbors, or use a native table.search(...).where(...) query instead.

Example filter ideas

Typical keys to filter on include category, department, priority, and timestamp (use comparable ISO-8601 strings for time ranges). Encode predicates in LanceDB SQL against your table columns (often the serialized metadata string), or inspect hit["entity"]["content_metadata"] after search as in the lancedb_retrieval example below.

Example: Use a Filter Expression in Search

After ingestion is complete, and documents are uploaded to LanceDB with metadata, you can narrow results in the database with a where clause, or in Python on the returned hits.

Native LanceDB (SQL pushdown): connect, embed the query yourself (same model as ingestion), then chain .where("<LanceDB SQL predicate>") on table.search(...) so filtering happens before the limit. Exact SQL depends on how metadata is stored; see LanceDB SQL.

import lancedb

# Pseudocode sketch — replace YOUR_VECTOR and YOUR_PREDICATE with real values.
db = lancedb.connect("./lancedb_data")
table = db.open_table("nemo_retriever_collection")
# table.search(YOUR_VECTOR, vector_column_name="vector").where(YOUR_PREDICATE).limit(10).to_list()

lancedb_retrieval + post-filter: the helper only returns top_k rows with no where argument; filtering in Python is for illustration and does not change what the database evaluates.

Use the lancedb_retrieval helper from the same LanceDB module you use with create_ingestor (see Python API).

hostname = "localhost"
table_name = "nemo_retriever_collection"
lancedb_uri = "./lancedb_data"
top_k = 5
model_name = "nvidia/llama-nemotron-embed-vl-1b-v2"

queries = ["this is expensive"]
q_results = []
for que in queries:
    batch = lancedb_retrieval(
        [que],
        table_path=lancedb_uri,
        table_name=table_name,
        embedding_endpoint=f"http://{hostname}:8012/v1",
        top_k=top_k,
        model_name=model_name,
    )
    # Application-side only: fewer than top_k hits if Engineering rows are not in this batch
    filtered = [
        hit
        for hit in batch[0]
        if hit.get("entity", {})
        .get("content_metadata", {})
        .get("department")
        == "Engineering"
    ]
    q_results.append(filtered)

print(f"{q_results}")

For a notebook that uses the CLI to add custom metadata and filter query results, refer to metadata_and_filtered_search.ipynb .