MLS Portuguese

The config performs the following data processing.

  1. Downloads and extracts all the data from the “” in Portuguese

  2. Converts all flac audio files to wav format.

  3. Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions.

  4. Drops any data that contains high/low character occurence.

  5. Drops any data that contains symbols not in the supported alphabet.

Required arguments.

  • workspace_dir: specify the workspace folder where all audio files will be stored.

  • data_split: should be “train”, “dev” or “test”.

Output format.

This config dumps the final manifest at ${workspace_dir}/${data_split}_manifest.json. The output manifest contains the following fields:

  • audio_filepath (str): relative path to the audio files.

  • text (str): transcription, including punctuation “.,?” and capitalization.

  • duration (float): audio duration in seconds.

Config link: dataset_configs/portuguese/mls/config.yaml