MLS Italian (no P&C)

MLS Italian (no P&C)#

This config can be used to prepare Multilingual LibriSpeech Italian data in the NeMo format without punctuation or capitalization (P&C).

It performs the following data processing.

  1. Downloads and extract the data from the original website.

  2. Converts all audio files into the wav format and generates initial nemo manifest file.

  3. Lower-cases text and removes all punctuation markers.

  4. Drops any data that contains symbols not in the supported alphabet.

  5. For training subset, the following additional filtering is performed:

    1. Runs ASR inference with an older model and drops all utterances which contain more than 5 consecutive word insertions or deletions. 5 was found to be a good threshold to filter out incorrect transcriptions.

Required arguments.

  • workspace_dir: specify the workspace folder where all audio files will be stored.

  • data_split: can be “train”, “dev” or “test”.

Note that you can customize any part of this config either directly or from command-line.

Output format.

This config dumps the final manifest at ${workspace_dir}/${data_split}_manifest.json. The output manifest contains the following fields:

  • audio_filepath (str): relative path to the audio files.

  • text (str): transcription, including punctuation “.,?” and capitalization.

  • duration (float): audio duration in seconds.

Config link: dataset_configs/italian/mls/config_nopc.yaml