MCV Georgian#
This config was originally designed for the Mozilla Common Voice (MCV) MCV dataset 17.0 release, but should work for any subsequent releases as well.
Note: Georgian is a unicameral language, meaning it does not have uppercase and lowercase letters. It always uses the same case for all letters.
During the preprocessing, we are going to use not only validated data (train/dev/test) from MCV but also unvalidated data (other.tsv) which requires more processing for better and clearer results. You can use the same data processing for Fleur’s Georgian data.
This config performs the following data processing:
Extracts and converts all data to the NeMo format.
Replaces certain non-supported characters and punctuation marks with equivalent supported versions.
Drops any duplicates from current manifest if they are presented in another manifests.
Drops any data that does not contain any Georgian letters.
Drops any data that contains symbols not in the supported alphabet.
Drops any data that contains high/low character occurrence.
Drops any data that contains high/low word occurrence.
Drops any data that has a duration of more than 18 seconds.
Required arguments.
workspace_dir: specify the workspace folder where all audio files will be stored. You need to manually place the downloaded MCV Georgian data inside
<workspace dir>/raw_data/
subfolder.data_split: should be “train”, “dev”, “test” or “other”.
Note: due to the text deduplication in step 3, we recommend processing the data in the following order: test, dev, train, other.
Output format.
This config dumps the final manifest at ${workspace_dir}/${data_split}_manifest.json
.
The output manifest contains the following fields:
audio_filepath (str): relative path to the audio files.
text (str): transcription, including punctuation “.,?” and capitalization.
duration (float): audio duration in seconds.
Config link: dataset_configs/georgian/mcv/config.yaml