VoxPopuli Italian#
This config can be used to prepare VoxPopuli Italian data in the NeMo format with restored punctuation and capitalization (P&C).
It performs the following data processing.
Installs requirements and runs the scripts from facebookresearch/voxpopuli to get initial data.
Converts all audio files into the wav format and generates initial nemo manifest file.
Original VoxPopuli data has P&C, but in a non-normalized format. We match the normalized and non-normalized versions to restore P&C in the normalized form.
Replaces certain non-supported characters and punctuation marks with equivalent supported versions.
Drops any data that contains symbols not in the supported alphabet.
For training subset, the following additional filtering is performed:
Runs ASR inference with an older model and drops all utterances which contain more than 5 consecutive word insertions or deletions. 5 was found to be a good threshold to filter out incorrect transcriptions.
Drops all utterances with duration less than 1.5 seconds, as they are often incorrectly transcribed.
Required arguments.
workspace_dir: specify the workspace folder where all audio files will be stored.
data_split: can be “train”, “dev” or “test”.
Note that you can customize any part of this config either directly or from command-line. Here are some common customizations to consider:
remove_pc: set to True if P&C is not needed. Defaults to False.
Output format.
This config dumps the final manifest at ${workspace_dir}/${data_split}_manifest.json
.
The output manifest contains the following fields:
audio_filepath (str): relative path to the audio files.
text (str): transcription, including punctuation “.,?” and capitalization.
duration (float): audio duration in seconds.
Config link: dataset_configs/italian/voxpopuli/config.yaml