Use the NeMo Retriever Extraction Python API
The NeMo Retriever extraction Python API provides a simple and flexible interface for processing and extracting information from various document types, including PDFs.
Note
NeMo Retriever extraction is also known as NVIDIA Ingest and nv-ingest.
Tip
There is a Jupyter notebook available to help you get started with the Python API. For more information, refer to Python Client Quick Start Guide.
Summary of Key Methods
The main class in the nv-ingest API is Ingestor
.
The Ingestor
class provides an interface for building, managing, and running data ingestion jobs, enabling for chainable task additions and job state tracking.
The following table describes methods of the Ingestor
class.
Method | Description |
---|---|
caption |
Extract captions from images within the document. |
embed |
Generate embeddings from extracted content. |
extract |
Add an extraction task (text, tables, charts, infographics). |
files |
Add document paths for processing. |
ingest |
Submit jobs and retrieve results synchronously. |
load |
Ensure files are locally accessible (downloads if needed). |
save_to_disk |
Save ingestion results to disk instead of memory. |
split |
Split documents into smaller sections for processing. For more information, refer to Split Documents. |
vdb_upload |
Push extraction results to Milvus vector database. For more information, refer to Data Upload. |
Track Job Progress
For large document batches, you can enable a progress bar by setting show_progress
to true.
Use the following code.
# Return only successes
results = ingestor.ingest(show_progress=True)
print(len(results), "successful documents")
Capture Job Failures
You can capture job failures by setting return_failures
to true.
Use the following code.
# Return both successes and failures
results, failures = ingestor.ingest(show_progress=True, return_failures=True)
print(f"{len(results)} successful docs; {len(failures)} failures")
if failures:
print("Failures:", failures[:1])
When you use the vdb_upload
method, uploads are performed after ingestion completes.
The behavior of the upload depends on the following values of return_failures
:
- False – If any job fails, the
ingest
method raises a runtime error and does not upload any data (all-or-nothing data upload). This is the default setting. - True – If any jobs succeed, the results from those jobs are uploaded, and no errors are raised (partial data upload). The
ingest
method returns a failures object that contains the details for any jobs that failed. You can inspect the failures object and selectively retry or remediate the failed jobs.
The following example uploads data to Milvus and returns any failures.
ingestor = (
Ingestor(client=client)
.files(["/path/doc1.pdf", "/path/doc2.pdf"])
.extract()
.embed()
.vdb_upload(collection_name="my_collection", milvus_uri="milvus.db")
)
# Use for large batches where you want successful chunks/pages to be committed, while collecting detailed diagnostics for failures.
results, failures = ingestor.ingest(return_failures=True)
print(f"Uploaded {len(results)} successful docs; {len(failures)} failures")
if failures:
print("Failures:", failures[:1])
Quick Start: Extracting PDFs
The following example demonstrates how to initialize Ingestor
, load a PDF file, and extract its contents.
The extract
method enables different types of data to be extracted.
Extract a Single PDF
Use the following code to extract a single PDF file.
from nv_ingest_client.client.interface import Ingestor
# Initialize Ingestor with a local PDF file
ingestor = Ingestor().files("path/to/document.pdf")
# Extract text, tables, and images
result = ingestor.extract().ingest()
print(result)
Extract Multiple PDFs
Use the following code to process multiple PDFs at one time.
ingestor = Ingestor().files(["path/to/doc1.pdf", "path/to/doc2.pdf"])
# Extract content from all PDFs
result = ingestor.extract().ingest()
for doc in result:
print(doc)
Extract Specific Elements from PDFs
By default, the extract
method extracts all supported content types.
You can customize the extraction behavior by using the following code.
ingestor = ingestor.extract(
extract_text=True, # Extract text
text_depth="page",
extract_tables=False, # Skip table extraction
extract_charts=True, # Extract charts
extract_infographics=True, # Extract infographic images
extract_images=False # Skip image extraction
)
Extract Non-standard Document Types
Use the following code to extract text from .md
, .sh
, and .html
files.
ingestor = Ingestor().files(["path/to/doc1.md", "path/to/doc2.html"])
ingestor = ingestor.extract(
extract_text=True, # Only extract text
extract_tables=False,
extract_charts=False,
extract_infographics=False,
extract_images=False
)
result = ingestor.ingest()
Extract with Custom Document Type
Use the following code to specify a custom document type for extraction.
ingestor = ingestor.extract(document_type="pdf")
Work with Large Datasets: Save to Disk
By default, NeMo Retriever extraction stores the results from every document in system memory (RAM).
When you process a very large dataset with thousands of documents, you might encounter an Out-of-Memory (OOM) error.
The save_to_disk
method configures the extraction pipeline to write the output for each document to a separate JSONL file on disk.
Basic Usage: Save to a Directory
To save results to disk, simply chain the save_to_disk
method to your ingestion task.
By using save_to_disk
the ingest
method returns a list of LazyLoadedList
objects,
which are memory-efficient proxies that read from the result files on disk.
In the following example, the results are saved to a directory named my_ingest_results
.
You are responsible for managing the created files.
ingestor = Ingestor().files("large_dataset/*.pdf")
# Use save_to_disk to configure the ingestor to save results to a specific directory.
# Set cleanup=False to ensure that the directory is not deleted by any automatic process.
ingestor.save_to_disk(output_directory="./my_ingest_results", cleanup=False) # Offload results to disk to prevent OOM errors
# 'results' is a list of LazyLoadedList objects that point to the new jsonl files.
results = ingestor.extract().ingest()
print("Ingestion results saved in ./my_ingest_results")
# You can now iterate over the results or inspect the files directly.
Managing Disk Space with Automatic Cleanup
When you use save_to_disk
, NeMo Retriever extraction creates intermediate files.
For workflows where these files are temporary, NeMo Retriever extraction provides two automatic cleanup mechanisms.
-
Directory Cleanup with Context Manager — While not required for general use, the Ingestor can be used as a context manager (
with
statement). This enables the automatic cleanup of the entire output directory whensave_to_disk(cleanup=True)
is set (which is the default). -
File Purge After VDB Upload – The
vdb_upload
method includes apurge_results_after_upload: bool = True
parameter (the default). After a successful VDB upload, this feature deletes the individual.jsonl
files that were just uploaded.
You can also configure the output directory by using the NV_INGEST_CLIENT_SAVE_TO_DISK_OUTPUT_DIRECTORY
environment variable.
Example (Fully Automatic Cleanup)
Fully Automatic cleanup is the recommended pattern for ingest-and-upload workflows where the intermediate files are no longer needed. The entire process is temporary, and no files are left on disk. The following example includes automatic file purge.
# After the 'with' block finishes,
# the temporary directory and all its contents are automatically deleted.
with (
Ingestor()
.files("/path/to/large_dataset/*.pdf")
.extract()
.embed()
.save_to_disk() # cleanup=True is the default, enables directory deletion on exit
.vdb_upload() # purge_results_after_upload=True is the default, deletes files after upload
) as ingestor:
results = ingestor.ingest()
Example (Preserve Results on Disk)
In scenarios where you need to inspect or use the intermediate jsonl
files, you can disable the cleanup features.
The following example disables automatic file purge.
# After the 'with' block finishes,
# the './permanent_results' directory and all jsonl files are preserved for inspection or other uses.
with (
Ingestor()
.files("/path/to/large_dataset/*.pdf")
.extract()
.embed()
.save_to_disk(output_directory="./permanent_results", cleanup=False) # Specify a directory and disable directory-level cleanup
.vdb_upload(purge_results_after_upload=False) # Disable automatic file purge after the VDB upload
) as ingestor:
results = ingestor.ingest()
Extract Captions from Images
The caption
method generates image captions by using a vision-language model.
This can be used to describe images extracted from documents.
Note
The default model used by caption
is nvidia/llama-3.1-nemotron-nano-vl-8b-v1
.
ingestor = ingestor.caption()
To specify a different API endpoint, pass additional parameters to caption
.
ingestor = ingestor.caption(
endpoint_url="https://integrate.api.nvidia.com/v1/chat/completions",
model_name="nvidia/llama-3.1-nemotron-nano-vl-8b-v1",
api_key="nvapi-"
)
Extract Embeddings
The embed
method in NV-Ingest generates text embeddings for document content.
ingestor = ingestor.embed()
Note
By default, embed
uses the llama-3.2-nv-embedqa-1b-v2 model.
To use a different embedding model, such as nv-embedqa-e5-v5, specify a different model_name
and endpoint_url
.
ingestor = ingestor.embed(
endpoint_url="https://integrate.api.nvidia.com/v1",
model_name="nvidia/nv-embedqa-e5-v5",
api_key="nvapi-"
)
Extract Audio
Use the following code to extract mp3 audio content.
from nv_ingest_client.client import Ingestor
ingestor = Ingestor().files("audio_file.mp3")
ingestor = ingestor.extract(
document_type="mp3",
extract_text=True,
extract_tables=False,
extract_charts=False,
extract_images=False,
extract_infographics=False,
).split(
tokenizer="meta-llama/Llama-3.2-1B",
chunk_size=150,
chunk_overlap=0,
params={"split_source_types": ["mp3"], "hf_access_token": "hf_***"}
)
results = ingestor.ingest()