Skip to content

Research Notes

NVIDIA NeMo Canary Model Pushes the Frontier of Speech Recognition and Translation

NVIDIA NeMo team is thrilled to announce Canary, a multilingual model that sets a new standard in speech-to-text recognition and translation. Canary transcribes speech in English, Spanish, German, and French and also generates text with punctuation and capitalization. Canary supports bi-directional translation, between English and three other supported languages. Canary achieves the first place on HuggingFace Open ASR leaderboard with an average word error rate of 6.67%, outperforming all other open source models by a wide margin.


Unveiling NVIDIA NeMo's Parakeet-TDT -- Turbocharged ASR with Unrivaled Accuracy

Earlier this month, we announced Parakeet, a cutting-edge collection of state-of-the-art ASR models built by NVIDIA's NeMo toolkit, developed jointly with Suno.ai. Today, we're thrilled to announce the latest addition to the Parakeet family -- Parakeet TDT. Parakeet TDT achieves unrivaled accuracy while running 64% faster over our previous best model, making it a great choice for powering speech recognition engines in diverse environments.

The "TDT" in Parakeet-TDT is short for "Token-and-Duration Transducer", a novel sequence modeling architecture developed by NVIDIA and is open-sourced through NVIDIA's NeMo toolkit. Our research on TDT models, presented in a paper at the ICML 2023 conference, showcases the superior speed and recognition accuracy of TDT models compared to conventional Transducers of similar sizes.

To put things in perspective, our Parakeet-TDT model with 1.1 billion parameters outperforms similar-sized Parakeet-RNNT-1.1b in accuracy, as measured as the average performance among 9 benchmarks on the HuggingFace Leaderboard. Notably, Parakeet-TDT is the first model to achieve an average WER below 7.0 on the leaderboard. Additionally, it achieves an impressive real-time factor (RTF) of 8.8e-3, 64% faster than Parakeet-RNNT-1.1b's RTF of 14.4e-3. Remarkably, Parakeet-TDT's RTF is even 40% faster than Parakeet-RNNT-0.6b (RTF 12.3), despite the latter having about half the model size.

HuggingFace Leaderboard

Figure 1. HuggingFace Leaderboard as of 01/31/2024.

Use Parakeet-TDT model in your code

To run speech recognition with Parakeet-TDT, NVIDIA NeMo needs to be installed as a pip package as shown below. Cython and PyTorch (2.0 and above) should be installed before attempting to install NeMo Toolkit.

pip install nemo_toolkit['asr']

Once NeMo is installed, you can use Parakeet-TDT to recognize your audio files as follows:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-tdt-1.1b")
transcript = asr_model.transcribe(["some_audio_file.wav"])

Understanding Token-and-Duration Transducers

Token-and-Duration Transducers (TDT) represent a significant advancement over traditional Transducer models by drastically reducing wasteful computations during the recognition process. To grasp this improvement, let's delve into the workings of a typical Transducer model.

RNNTTOPO

Figure 2. Transducer Model Architecture

Transducer models, as illustrated in Figure 2, consist of an encoder, a decoder, and a joiner. During speech recognition, the encoder processes audio signals, extracting crucial information from each frame. The decoder extracts information from already predicted text. The joiner then combines the outputs from the encoder and decoder, and predict a text token for each audio frame. From the joiner's perspective, a frame typically covers 40 to 80 milliseconds of audio signal, while on average people speak a word per 400 milliseconds. This discrepancy makes it so that certain frames don't associate with any text output. For those frames, the Transducer would predict a "blank" symbol. A typical sequence of predictions of a Transducer looks something like,

_ _ _ _ NVIDIA _ _ _ _ is _ _ _ a _ _ great _ _ _ _ _ place _ _ _ _ to work _ _ _

where _ represents the blank symbol. To generate the final recognizion output, the model would delete all the blanks, and generate the output

NVIDIA is a great place to work

As we can see, there are many blanks symbols in the original output and this means the Transducer model wasted a lot of time on "blank frames" -- frames for which the model predicts blanks which don't contribute to the final output.

TDTTOPO

Figure 3. TDT Model Architecture

TDT is designed to mitigate wasted computation by intelligently detecting and skipping blank frames during recognition. As Figure 3 shows, when a TDT model processes a frame, it simultaneously predicts two things:

  1. probability of token PT(v|t, u): the token that should be predicted at the current frame;
  2. probability of duration PD(d|t, u): the number of frames the current token lasts before the model can make the next token prediction.

The TDT model is trained to maximize the number of frames skipped by using the duration prediction while maintaining the same recognition accuracy. For example in the example above, unlike a conventional Transducer that predict a token for every speech frame, the TDT model can simply the process as follows,

frame 1:  predict token=_,      duration=4
frame 5:  predict token=NVIDIA, duration=5
frame 10: predict token=is,     duration=4
frame 14: predict token=a,      duration=3
frame 17: predict token=great,  duration=6
frame 23: predict token=place,  duration=5
frame 28: predict token=to,     duration=1
frame 29: predict token=work,   duration=4
frame 33: reached the end of audio, recognition completed.

In this toy example, TDT can reduce the number of predictions the model have to make from 33 to 8. In our extensive experiments with TDT models, we see this optimization indeed leads to a substantial acceleration in recognition speed. Our research has also demonstrated that TDT models exhibit enhanced robustness to noisy speech and token repetitions in the text compared to traditional Transducer models. Note, this blog post simplies certain aspects of Transducer models in order to better illustrate the design differences between Transducers and TDT, and we would refer interested readers to our paper for technical details.

Additional Resources


Announcing NVIDIA NeMo Parakeet ASR Models for Pushing the Boundaries of Speech Recognition

NVIDIA NeMo, a leading open-source toolkit for conversational AI, announces the release of Parakeet, a family of state-of-the-art automatic speech recognition (ASR) models (Figure 1.), capable of transcribing spoken English with exceptional accuracy. Developed in collaboration with Suno.ai, Parakeet ASR models mark a significant leap forward in speech recognition, paving the way for more natural and efficient human-computer interactions.

HuggingFace Leaderboard

Figure 1. HuggingFace Leaderboard as of 01/03/2024.

NVIDIA announces four Parakeet models based on RNN Transducer / Connectionist Temporal Classification decoders and the size of the models. They boast 0.6-1.1 billion parameters and capable of tackling diverse audio environments. Trained on only a 64,000-hour dataset encompassing various accents, domains, and noise conditions, the models deliver exceptional word error rate (WER) performance across benchmark datasets, outperforming previous models.

  • Parakeet RNNT 1.1B - Best recognition accuracy, modest inference speed. Best used when the most accurate transcriptions are necessary.
  • Parakeet CTC 1.1B - Fast inference, strong recognition accuracy. A great middle ground between accuracy and speed of inference.
  • Parakeet RNNT 0.6B - Strong recognition accuracy and fast inference. Useful for large-scale inference on limited resources.
  • Parakeet CTC 0.6B - Fastest speed, modest recognition accuracy. Useful when transcription speed is the most important.

Training NeMo RNN-T Models Efficiently with Numba FP16 Support

In the field of Automatic Speech Recognition research, RNN Transducer (RNN-T) is a type of sequence-to-sequence model that is well-known for being able to achieve state-of-the-art transcription accuracy in offline and real-time (A.K.A. "streaming") speech recognition applications. They are also notorious for having high memory requirements. In this blog post we will explain why they have this reputation, and how NeMo allows you to side-step many of the memory requirements issues, including how to make use of Numba’s recent addition of FP16 support.


How does forced alignment work?

In this blog post we will explain how you can use an Automatic Speech Recognition (ASR) model1 to match up the text spoken in an audio file with the time when it is spoken. Once you have this information, you can do downstream tasks such as:

  • creating subtitles such as in the video below2 or in the Hugging Face space

  • obtaining durations of tokens or words to use in Text To Speech or speaker diarization models

  • splitting long audio files (and their transcripts) into shorter ones. This is especially useful when making datasets for training new ASR models, since audio files that are too long will not be able to fit onto a single GPU during training. 3


Introducing NeMo Forced Aligner

Today we introduce NeMo Forced Aligner: a NeMo-based tool for forced alignment.

NFA allows you to obtain token-level, word-level and segment-level timestamps for words spoken in an audio file. NFA produces timestamp information in a variety of output file formats, including subtitle files, which you can use to create videos such as the one below1:


Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition

The Conformer architecture, introduced by Gulati et al. has been a standard architecture used for not only Automatic Speech Recognition, but has also been extended to other tasks such as Spoken Language Understanding, Speech Translation, and used as a backbone for Self Supervised Learning for various downstream tasks. While they are highly accurate models on each of these tasks, and can be extended for use in other tasks, they are also very computationally expensive. This is due to the quadratic complexity of the attention mechanism, which makes it difficult to train and infer on long sequences, which are used as input to these models due to the granular stride of audio pre-processors (commonly Mel Spectrograms or even raw audio signal in certain models with 10 milliseconds stride). Furthermore, the memory requirement of quadratic attention also significantly limits the audio duration during inference.


NeMo on the NVIDIA Technical blog in 2023

The following blog posts have been published by the NeMo team on the NVIDIA Technical blog in 2023.

January 2023

Based on work accepted to SLT 2022:


NeMo on the NVIDIA Technical blog in 2022

The following blog posts were published by the NeMo team on the NVIDIA Technical blog in 2022.

August 2022


September 2022

Based on work accepted to Interspeech 2022:


NeMo Blog Posts and Announcements

NVIDIA NeMo is a conversational AI toolkit that supports multiple domains such as Automatic Speech Recognition (ASR), Text to Speech generation (TTS), Speaker Recognition (SR), Diarization (SDR), Natural Language Processing (NLP), Neural Machine translation (NMT) and much more. NVIDIA RIVA has long been the toolkit that enables efficient deployment of NeMo models. In recent months, NeMo Megatron supports training and inference on large language models (upto 1 trillion parameters !).

As NeMo becomes capable of more advanced tasks, such as p-tuning / prompt tuning of NeMo Megatron models, domain adaptation of ASR models using Adapter modules, customizable generative TTS models and much more, we introduce this website as a collection of blog posts and announcements for:

  • Technical deep dives of NeMo's capabilities
  • Presenting state-of-the-art research results
  • Announcing new capabilities and domains of research that our team will work on.

Visit NVIDIA NeMo to get started