Kazakh Speech Corpus 2#

This config is designed for the Kazakh Speech Corpus 2 dataset. The Dataset is available by request. The form for the request can be found in the website above.

It performs the following data processing.

Extracts and converts all data to the specified manifest format.
Gets audio durations and then keeps only instances with the duration greater than 0.
Performs replacement of certain punctuation marks and characters.
Converts visually identical cyrrilic letters to latin equivalent ones.
Drops any data that contains symbols not in the supported alphabet.
If required removes punctuation marks and makes utterances lowercase.

Required arguments.

workspace_dir: specify the workspace folder where all audio files will be stored. You need to manually place the downloaded .tar files data inside <workspace dir> folder.
data_split: should be “train”, “dev” or “test”.

Note that you can customize any part of this config either directly or from command-line. Here are some common customizations to consider:

remove_pc: set to True if P&C is not needed. Defaults to False.
remove_hyphen: set to True if hyphens is not needed. Defaults to False.

Output format.

This config dumps the final manifest at ${workspace_dir}/${data_split}_manifest.json. The output manifest contains the following fields:

audio_filepath (str): relative path to the audio files.
text (str): transcription, including punctuation “.,?” and capitalization.
duration (float): audio duration in seconds.
source (str): source of the utterance.

Config link: dataset_configs/kazakh/ksc2/config.yaml