Skip to content
Back to index

2024-02-08 · 5 minute read

NVIDIA NeMo Canary Model Pushes the Frontier of Speech Recognition and Translation

NVIDIA NeMo team is thrilled to announce Canary, a multilingual model that sets a new standard in speech-to-text recognition and translation. Canary transcribes speech in English, Spanish, German, and French and also generates text with punctuation and capitalization. Canary supports bi-directional translation, between English and three other supported languages. Canary achieves the first place on HuggingFace Open ASR leaderboard with an average word error rate of 6.67%, outperforming all other open source models by a wide margin.

Canary can transcribe and translate English, German, Spanish and French.

Canary is trained on a combination of public and in-house data. It uses 85,000 hours of annotated speech to learn speech recognition. To teach Canary translation, we used NVIDIA NeMo machine translation models to generate translations of the original transcripts in all supported languages. Despite using an order of magnitude less data, Canary outperforms similarly-sized Whisper-large-v3, and SeamlessM4T-Medium-v1 on both transcription and translation tasks.


Figure 1. Speech recognition: average WER on MCV 16.1 test sets for English, Spanish, French, and German (Lower is better).
Figure 2. Speech Translation: (left) average BLEU scores on Fleurs and MExpresso test sets translating from English to Spanish, French, and German. (right) average BLEU scores on Fleurs and CoVoST test sets translating from Spanish, French, and German to English (Higher is better).

Canary is an encoder-decoder model built on several innovations from the NVIDIA NeMo team. The encoder is Fast-Conformer, an efficient Conformer architecture optimized for ~3x savings on compute and ~4x savings on memory. The encoder processes audio in the form of log-mel spectrogram features and the decoder, a transformer decoder, generates output text tokens in an auto-regressive manner. The decoder is prompted with special tokens to control whether Canary performs transcription or translation. Canary also incorporates the Concatenated tokenizer, offering explicit control of output token space.

The model weights are distributed under a research-friendly non-commercial CC BY-NC 4.0 license, while the code used to train this model is available under the Apache 2.0 license from NVIDIA NeMo Toolkit.

Transcribing with Canary

To use Canary, NVIDIA NeMo toolkit needs to be installed as a pip package as shown below. Cython and PyTorch (2.0 and above) should be installed before attempting to install NeMo Toolkit.

pip install git+[asr]

Once NeMo is installed, you can use Canary to transcribe or translate audio files as follows:

# Load Canary model 
from nemo.collections.asr.models import EncDecMultiTaskModel
canary_model = EncDecMultiTaskModel.from_pretrained('nvidia/canary-1b')

# Prepare input - Example lines in transcribe_manifest.json
    # Example to trasribe En audio
    "audio_filepath": "/path/to/audio.wav",  # path to the audio file
    "duration": 40.0,  # duration of the audio in sec
    "taskname": "asr",  # use "asr" for transcription and "ast" for Speech to Text translation.
    "source_lang": "en",  # Set `source_lang`=`target_lang` for ASR, choices=['en','de','es','fr']; set `source_lang`='en' and `target_lang`='de' for En -> De translation.
    "target_lang": "en",  # choices=['en','de','es','fr']
    "pnc": 'yes',  # whether to have PnC output, choices=['yes', 'no'] 

    # Example to translate from English audio to German text
    "audio_filepath": "/path/to/audio.wav",  # path to the audio file
    "duration": 40.0,  
    "taskname": "ast",  
    "source_lang": "en",  
    "target_lang": "de", 
    "pnc": 'yes',

# Finally transcribe
transcript = canary_model.transcribe(paths2audio_files="<path to transcribe_manifest.json>", batch_size=4,)

Additional Resources