MCV Portuguese

MCV Portuguese#

This config was originally designed for the Mozilla Common Voice (MCV) dataset 15.0 release, but should work for any subsequent releases as well.

It performs the following data processing.

  1. Extracts and converts all data to the NeMo format.

  2. Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions.

  3. Drops any data that contains high/low character occurence.

  4. Drops any data that contains symbols not in the supported alphabet.

Required arguments.

  • workspace_dir: specify the workspace folder where all audio files will be stored. You need to manually place the downloaded MCV Portuguese data inside <workspace dir>/raw_data/ subfolder.

  • data_split: should be “train”, “dev” or “test”.

Output format.

This config dumps the final manifest at ${workspace_dir}/${data_split}_manifest.json. The output manifest contains the following fields:

  • audio_filepath (str): relative path to the audio files.

  • text (str): transcription, including punctuation “.,?” and capitalization.

  • duration (float): audio duration in seconds.

Config link: dataset_configs/portuguese/mcv/config.yaml