FLEURS#

The config creates manifest for FLEURS dataset Arabic subset.

The config performs the following data processing. 1. Replaces certain non-supported characters, abbreviations and punctuation marks with equivalent supported versions. 2. Drops any data that contains high/low word occurence. 3. Drops any data that contains symbols not in the supported alphabet. 4. Can be used to remove punctuation and diacritical marks. 5. Can be used to replace positional forms of Arabic letters with general unicodes. 6. Can be used to normalize Arabic ligatures. 7. Can be used to remove Quranic Tatweel mark.

Required arguments.

raw_dataset_dir: path to the tarred dataset.
workspace_dir: specify the workspace folder where all audio files will be stored.
data_split: should be “train”, “dev”, “test”.
remove_diacritics: specify whether to remove tatweel marks or not. Should be “True” or “False”. Defaults to True.
remove_tatweel: specify whether to remove punctuation or not. Should be “True” or “False”. Defaults to True.
normalize_ligature: specify whether to normalize ligature or not. Should be “True” or “False”. Defaults to True.
apply_nfkc: Applies NFKC normalization to the text. Find more here https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize. Defaults to True.
min_duration: minimal duration of segment in seconds. Defaults to 0.1s.
max_duration: maximal duration of segment in seconds. Defaults to 20s.
min_wordrate: minimal wordrate. Defaults to 0.8.
max_wordrate: maximal wordrate. Defaults to 3.

Output format.

This config dumps the final manifest at ${workspace_dir}/${data_split}/manifest.json and wav files ${workspace_dir}/${data_split}/audios. The output manifest contains the following fields:

audio_filepath (str): relative path to the audio files.
text (str): transcription.
duration (float): audio duration in seconds.

Config link: dataset_configs/arabic/fleurs/config.yaml