Skip to content
Back to index

2024-01-03 · 5 minute read

Announcing NVIDIA NeMo Parakeet ASR Models for Pushing the Boundaries of Speech Recognition

NVIDIA NeMo, a leading open-source toolkit for conversational AI, announces the release of Parakeet, a family of state-of-the-art automatic speech recognition (ASR) models (Figure 1.), capable of transcribing spoken English with exceptional accuracy. Developed in collaboration with Suno.ai, Parakeet ASR models mark a significant leap forward in speech recognition, paving the way for more natural and efficient human-computer interactions.

HuggingFace Leaderboard

Figure 1. HuggingFace Leaderboard as of 01/03/2024.

NVIDIA announces four Parakeet models based on RNN Transducer / Connectionist Temporal Classification decoders and the size of the models. They boast 0.6-1.1 billion parameters and capable of tackling diverse audio environments. Trained on only a 64,000-hour dataset encompassing various accents, domains, and noise conditions, the models deliver exceptional word error rate (WER) performance across benchmark datasets, outperforming previous models.

  • Parakeet RNNT 1.1B - Best recognition accuracy, modest inference speed. Best used when the most accurate transcriptions are necessary.
  • Parakeet CTC 1.1B - Fast inference, strong recognition accuracy. A great middle ground between accuracy and speed of inference.
  • Parakeet RNNT 0.6B - Strong recognition accuracy and fast inference. Useful for large-scale inference on limited resources.
  • Parakeet CTC 0.6B - Fastest speed, modest recognition accuracy. Useful when transcription speed is the most important.

Parakeet models exhibit resilience against non-speech segments, including music and silence, effectively preventing the generation of hallucinated transcripts.

Built using the NVIDIA NeMo toolkit, Parakeet prioritizes user-friendliness and flexibility. With pre-trained checkpoints readily available, integrating the model into your projects is a breeze. Whether looking for immediate inference capabilities or fine-tuning for specific tasks, NeMo provides a robust and intuitive framework to leverage the model's full potential.

Key benefits of Parakeet models:

  • State-of-the-art accuracy: Superior WER performance across diverse audio sources and domains with strong robustness to non-speech segments.
  • Different model sizes: Two models with 0.6B and 1.1B parameters for robust comprehension of complex speech patterns.
  • Open-source and extensibility: Built on NVIDIA NeMo, allowing for seamless integration and customization.
  • Pre-trained checkpoints: Ready-to-use models for inference or fine-tuning.
  • Permissive license: Released under CC-BY-4.0 license, model checkpoints can be used in any commercial application.

Parakeet is a major step forward in the evolution of conversational AI. Its exceptional accuracy, coupled with the flexibility and ease of use offered by NeMo, empowers developers to create more natural and intuitive voice-powered applications. The possibilities are endless, from enhancing the accuracy of virtual assistants to enabling seamless real-time communication.

The Parakeet family of models achieves state-of-the-art numbers on the HuggingFace Leaderboard. Users can try out the parakeet-rnnt-1.1b firsthand at the Gradio demo. To access the model locally and explore the toolkit, visit the NVIDIA NeMo Github page.

Architecture Details

Parakeet models are based on the Fast Conformer architecture published in ASRU 2023. Fast Conformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling, modified convolution kernel size, and an efficient subsampling module. Additionally it supports inference on very long audio segments (up to 11 hours of speech) on an A100 80GB card using Local Attention. The model is trained end-to-end using the Transducer decoder (RNNT) or Connectionist Temporal Classification decoder. For further details on long audio inference, please refer to the ICASSP 2024 paper “Investigating End-to-End ASR Architectures for Long Form Audio Transcription”.

Parakeet architecture

Figure 2. Fast Conformer Architecture shows blocks of downsampling, conformer encoder blocks with limited context attention (LCA), and global token (GT).

Usage

NVIDIA NeMo can be installed as a pip package as shown below. Cython and PyTorch (2.0 and above) should be installed before attempting to install NeMo Toolkit.

Then simply use:

pip install nemo_toolkit['asr']

Once installed, you can evaluate a list of audio files as follows:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-rnnt-1.1b")
transcript = asr_model.transcribe(["some_audio_file.wav"])

Long-Form Speech Inference

Once you have a Fast Conformer model loaded, you can easily modify the attention type to limited context attention after building the model. You can also apply audio chunking for the subsampling module to perform inference on huge audio files!

Note

These models were trained with global attention, and switching to local attention will degrade their performance. However, they will still be able to transcribe long audio files reasonably well.

For limited context attention on huge files (upto 11 hours on an A100), perform the following steps:

# Enable local attention
asr_model.change_attention_model("rel_pos_local_attn", [128, 128])  # local attn

# Enable chunking for subsampling module
asr_model.change_subsampling_conv_chunking_factor(1)  # 1 = auto select

# Transcribe a huge audio file
asr_model.transcribe(["<path to a huge audio file>.wav"])  # 10+ hours !

Additional Resources