Mozilla Common Voice Arabic (MCV)#

This config is designed for the Mozilla Common Voice (MCV) dataset 17.0 release, but should work for any subsequent releases as well.

The config performs the following data processing.

Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions.
Drops any data that contains high/low word occurence.
Drops any data that contains symbols not in the supported alphabet.

Required arguments.

raw_dataset_dir: path to the tarred dataset.
workspace_dir: specify the workspace folder where all audio files will be stored.
data_split: should be on “train”, “test”, “dev”, “invalidated”, “other”, “reported”, “validated”.
remove_punctuation: specify whether to remove punctuation or not. Should be “True” or “False”. Defaults to False.
remove_diacritics: specify whether to remove tatweel marks or not. Should be “True” or “False”. Defaults to True.
remove_tatweel: specify whether to remove punctuation or not. Should be “True” or “False”. Defaults to True.
normalize_ligature: specify whether to normalize ligature or not. Should be “True” or “False”. Defaults to True.
min_duration: minimal duration of segment in seconds. Defaults to 0.1s.
max_duration: maximal duration of segment in seconds. Defaults to 20s.
min_wordrate: minimal wordrate. Defaults to 0.8.
max_wordrate: maximal wordrate. Defaults to 3.

Output format.

This config dumps the final manifest at ${manifest_dir}/manifest.json and wav files ${manifest_dir}/audios. The output manifest contains the following fields:

audio_filepath (str): relative path to the audio files.
text (str): transcription.
duration (float): audio duration in seconds.

Config link: dataset_configs/arabic/mcv/config.yaml