Coraa Portuguese#
The config performs the following data processing.
Downloads and extracts all the data from the “https://huggingface.co/datasets/gabrielrstan/CORAA-v1.1/tree/main”
Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions.
Drops any data that contains high/low character occurence.
Drops any data that contains symbols not in the supported alphabet.
Required arguments.
workspace_dir: specify the workspace folder where all audio files will be stored.
data_split: should be “train”, “dev” or “test”.
Output format.
This config dumps the final manifest at ${workspace_dir}/${data_split}_manifest.json
.
The output manifest contains the following fields:
audio_filepath (str): relative path to the audio files.
text (str): transcription, including punctuation “.,?” and capitalization.
duration (float): audio duration in seconds.
Config link: dataset_configs/portuguese/coraa/config.yaml