Audio and video ingestion
Use this page for speech and audio extraction with Parakeet ASR and for video workflows that combine audio with OCR on frames or derived images.
Sections: Speech and audio (Parakeet) · Run Parakeet on the cluster (Helm) · Parakeet with hosted inference (build.nvidia.com) · Video and frame OCR
Speech and audio extraction
This documentation describes two ways to run NeMo Retriever Library with the parakeet-1-1b-ctc-en-us ASR NIM microservice (nvcr.io/nim/nvidia/parakeet-1-1b-ctc-en-us) to extract speech from audio files:
- Run the NIM locally on your cluster with the NeMo Retriever Helm chart
- Use NVIDIA Cloud Functions (NVCF) endpoints for cloud-based inference
Supported file types for speech extraction today:
mp3,wavmp4,mov,mkv,avi— common video containers; the audio track is transcribed (same extensions as in What is NeMo Retriever Library?)
NeMo Retriever Library supports extracting speech from audio for Retrieval Augmented Generation (RAG). Similar to how the multimodal document pipeline uses detection and OCR microservices, NeMo Retriever Library uses the parakeet-1-1b-ctc-en-us ASR NIM to transcribe speech to text, then embeddings via the NeMo Retriever embedding path.
Before running audio extraction from Python with either self-hosted or hosted Parakeet, install the multimedia extra so the Parakeet ASR client can decode and resample audio:
pip install "nemo-retriever[multimedia]"
# For local GPU inference, include both extras:
pip install "nemo-retriever[local,multimedia]"
Important
Due to limitations in available VRAM controls in the current release, the parakeet-1-1b-ctc-en-us ASR NIM must run on a dedicated additional GPU. For the full list of requirements, refer to the Pre-Requisites & Support Matrix.
This pipeline enables retrieval at the speech segment level when you enable segmenting (see examples below).

Run Parakeet on the cluster (Helm)
Use the following procedure to run the NIM on your own infrastructure. Self-hosted Parakeet runs on Kubernetes via the NeMo Retriever Helm chart.
Important
Pin the Parakeet workload to the dedicated GPU with your Helm values or the NIM Operator (for example, node selectors, resource limits, or device requests appropriate to your cluster).
-
Deploy or upgrade NeMo Retriever Library with the Helm chart and enable the ASR / audio components your release requires (Parakeet and related services). Follow Deploy (Helm chart) and Deployment options. Ensure the chart values for your cluster request the ASR NIM.
-
After the services are running, interact with the pipeline from Python.
- The
Ingestorobject initializes the ingestion process. - The
filesmethod specifies the input files to process. - The
extractmethod runs audio extraction.
ingestor = ( Ingestor() .files("./data/*.wav") .extract( document_type="wav", # Ingestor should detect type automatically in most cases extract_method="audio", extract_audio_params={ "segment_audio": True, }, ) )To generate one extracted element for each sentence-like ASR segment, include
extract_audio_params={"segment_audio": True}when calling.extract(...). This option applies when audio extraction runs with a self-hosted Parakeet NIM or using build.nvidia.com hosted inference, but has no effect when using the local Hugging Face Parakeet model.Tip
For more Python examples, refer to Python Quick Start Guide.
- The
Parakeet with hosted inference (build.nvidia.com)
Instead of running the pipeline locally, you can call Parakeet through build.nvidia.com hosted inference.
-
On the Parakeet model page on build.nvidia.com, create or copy an API key and note the function ID for hosted access. You need both before making API calls.
-
Run inference from Python with the hosted gRPC endpoint and credentials from that page (the example below uses the default hosted gRPC hostname; confirm values in the Get API Key flow for your deployment).
- The
Ingestorobject initializes the ingestion process. - The
filesmethod specifies the input files to process. - The
extractmethod runs audio extraction. - The
document_typeparameter is optional becauseIngestorshould detect the file type automatically in most cases.
ingestor = ( Ingestor() .files("./data/*.mp3") .extract( document_type="mp3", extract_method="audio", extract_audio_params={ "grpc_endpoint": "grpc.nvcf.nvidia.com:443", "auth_token": "<API key>", "function_id": "<function ID>", "use_ssl": True, "segment_audio": True, }, ) )Tip
For more Python examples, refer to Python Quick Start Guide.
- The
Video and frame OCR
For video assets, NeMo Retriever Library can combine audio or speech processing (see Speech and audio extraction above) with visual text extraction when OCR applies to frames or derived images.
For OCR-oriented extract methods on scanned or image-heavy content, see OCR and scanned documents, text and layout extraction, and Nemotron Parse for advanced visual parsing.
Container formats and early-access video types are listed under supported file types and formats (see What is NeMo Retriever Library? for the full list).
For end-to-end RAG stacks that include multimodal ingestion, see the NVIDIA AI Blueprints catalog and related solution pages on NVIDIA Build.
Related topics
- Pre-Requisites & Support Matrix
- Troubleshoot NeMo Retriever extraction
- Use the Python API
- Chunking (includes audio and video segmenting defaults)