Tarteel AI’s Everyayah

Tarteel AI’s Everyayah#

The config performs the following data processing. 1. Drops any data that contains symbols not in the supported alphabet. 2. Can be used to remove punctuation and diacritical marks. 3. Can be used to replace positional forms of Arabic letters with general unicodes. 4. Can be used to normalize Arabic ligatures.

Required arguments.

  • raw_dataset_dir: path to the tarred dataset.

  • workspace_dir: specify the workspace folder where all audio files will be stored.

  • data_split: should be “train”, “validation”, “test”.

  • remove_punctuation: specify whether to remove punctuation or not. Should be “True” or “False”. Defaults to False.

  • remove_diacritics: specify whether to remove tatweel marks or not. Should be “True” or “False”. Defaults to True.

  • remove_tatweel: specify whether to remove punctuation or not. Should be “True” or “False”. Defaults to True.

  • normalize_ligature: specify whether to normalize ligature or not. Should be “True” or “False”. Defaults to True.

  • apply_nfkc: Applies NFKC normalization to the text. Find more here https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize. Defaults to True.

  • min_duration: minimal duration of segment in seconds. Defaults to 0.1s.

  • max_duration: maximal duration of segment in seconds. Defaults to 20s.

Output format.

This config dumps the final manifest at ${workspace_dir}/${data_split}/manifest.json and wav files ${workspace_dir}/${data_split}/audios. The output manifest contains the following fields:

  • audio_filepath (str): relative path to the audio files.

  • text (str): transcription.

  • duration (float): audio duration in seconds.

Config link: dataset_configs/arabic/everyayah/config.yaml