MLS Italian (no P&C)#

This config can be used to prepare Multilingual LibriSpeech Italian data in the NeMo format without punctuation or capitalization (P&C).

It performs the following data processing.

Downloads and extract the data from the original website.
Converts all audio files into the wav format and generates initial nemo manifest file.
Lower-cases text and removes all punctuation markers.
Drops any data that contains symbols not in the supported alphabet.
For training subset, the following additional filtering is performed:
1. Runs ASR inference with an older model and drops all utterances which contain more than 5 consecutive word insertions or deletions. 5 was found to be a good threshold to filter out incorrect transcriptions.

Required arguments.

workspace_dir: specify the workspace folder where all audio files will be stored.
data_split: can be “train”, “dev” or “test”.

Note that you can customize any part of this config either directly or from command-line.

Output format.

This config dumps the final manifest at ${workspace_dir}/${data_split}_manifest.json. The output manifest contains the following fields: