2024-02-08 · 5 minute read
NVIDIA NeMo Canary Model Pushes the Frontier of Speech Recognition and Translation¶
NVIDIA NeMo team is thrilled to announce Canary, a multilingual model that sets a new standard in speech-to-text recognition and translation. Canary transcribes speech in English, Spanish, German, and French and also generates text with punctuation and capitalization. Canary supports bi-directional translation, between English and three other supported languages. Canary achieves the first place on HuggingFace Open ASR leaderboard with an average word error rate of 6.67%, outperforming all other open source models by a wide margin.
Canary is trained on a combination of public and in-house data. It uses 85,000 hours of annotated speech to learn speech recognition. To teach Canary translation, we used NVIDIA NeMo machine translation models to generate translations of the original transcripts in all supported languages. Despite using an order of magnitude less data, Canary outperforms similarly-sized Whisper-large-v3, and SeamlessM4T-Medium-v1 on both transcription and translation tasks.
Canary is an encoder-decoder model built on several innovations from the NVIDIA NeMo team. The encoder is Fast-Conformer, an efficient Conformer architecture optimized for ~3x savings on compute and ~4x savings on memory. The encoder processes audio in the form of log-mel spectrogram features and the decoder, a transformer decoder, generates output text tokens in an auto-regressive manner. The decoder is prompted with special tokens to control whether Canary performs transcription or translation. Canary also incorporates the Concatenated tokenizer, offering explicit control of output token space.
The model weights are distributed under a research-friendly non-commercial CC BY-NC 4.0 license, while the code used to train this model is available under the Apache 2.0 license from NVIDIA NeMo Toolkit.
Transcribing with Canary¶
To use Canary, NVIDIA NeMo toolkit needs to be installed as a pip package as shown below. Cython and PyTorch (2.0 and above) should be installed before attempting to install NeMo Toolkit.
Once NeMo is installed, you can use Canary to transcribe or translate audio files as follows:
# Load Canary model
from nemo.collections.asr.models import EncDecMultiTaskModel
canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')
# Prepare input - Example lines in transcribe_manifest.json
{
# Example to trasribe En audio
"audio_filepath": "/path/to/audio.wav", # path to the audio file
"duration": 40.0, # duration of the audio in sec
"taskname": "asr", # use "asr" for transcription and "ast" for Speech to Text translation.
"source_lang": "en", # Set `source_lang`=`target_lang` for ASR, choices=['en','de','es','fr']; set `source_lang`='en' and `target_lang`='de' for En -> De translation.
"target_lang": "en", # choices=['en','de','es','fr']
"pnc": 'yes', # whether to have PnC output, choices=['yes', 'no']
}
{
# Example to translate from English audio to German text
"audio_filepath": "/path/to/audio.wav", # path to the audio file
"duration": 40.0,
"taskname": "ast",
"source_lang": "en",
"target_lang": "de",
"pnc": 'yes',
}
# Finally transcribe
transcript = canary_model.transcribe(paths2audio_files="<path to transcribe_manifest.json>", batch_size=4,)