MTEDX Portuguese

MTEDX Portuguese#

The config performs the following data processing.

  1. Downloads and extracts the data from the “https://www.openslr.org/100/” in Portuguese

  2. Converts all flac audio files to wav format.

  3. Splits audio by the given time steps in vtt files.

  4. Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions.

  5. Drops any data that contains high/low character occurence.

  6. Drops any data that contains symbols not in the supported alphabet.

Required arguments.

  • workspace_dir: specify the workspace folder where all audio files will be stored.

  • raw_data_dir: specify in which folder the data will be downladed.

  • data_split: should be “train”, “valid” or “test”.

Output format.

This config dumps the final manifest at ${workspace_dir}/${data_split}_manifest.json. The output manifest contains the following fields:

  • audio_filepath (str): relative path to the audio files.

  • text (str): transcription, including punctuation “.,?” and capitalization.

  • duration (float): audio duration in seconds.

Config link: dataset_configs/portuguese/mtedx/config.yaml