VoxPopuli Italian

VoxPopuli Italian#

This config can be used to prepare VoxPopuli Italian data in the NeMo format with restored punctuation and capitalization (P&C).

It performs the following data processing.

  1. Installs requirements and runs the scripts from facebookresearch/voxpopuli to get initial data.

  2. Converts all audio files into the wav format and generates initial nemo manifest file.

  3. Original VoxPopuli data has P&C, but in a non-normalized format. We match the normalized and non-normalized versions to restore P&C in the normalized form.

  4. Replaces certain non-supported characters and punctuation marks with equivalent supported versions.

  5. Drops any data that contains symbols not in the supported alphabet.

  6. For training subset, the following additional filtering is performed:

    1. Runs ASR inference with an older model and drops all utterances which contain more than 5 consecutive word insertions or deletions. 5 was found to be a good threshold to filter out incorrect transcriptions.

    2. Drops all utterances with duration less than 1.5 seconds, as they are often incorrectly transcribed.

Required arguments.

  • workspace_dir: specify the workspace folder where all audio files will be stored.

  • data_split: can be “train”, “dev” or “test”.

Note that you can customize any part of this config either directly or from command-line. Here are some common customizations to consider:

  • remove_pc: set to True if P&C is not needed. Defaults to False.

Output format.

This config dumps the final manifest at ${workspace_dir}/${data_split}_manifest.json. The output manifest contains the following fields:

  • audio_filepath (str): relative path to the audio files.

  • text (str): transcription, including punctuation “.,?” and capitalization.

  • duration (float): audio duration in seconds.

Config link: dataset_configs/italian/voxpopuli/config.yaml