MLS Italian (with P&C)

MLS Italian (with P&C)#

This config is the same as MLS Italian (no P&C) except it adds punctuation and capitalization (P&C). MLS dataset does not contain any P&C so it needs to be restored from the original source or synthetically generated. We will run matching with the original source books and try to restore P&C for as much data as possible. When this restoration process fails, there are 3 options that we support (controlled with the pc_mode parameter):

  • full_drop: we can fully drop the utterances for which P&C was not restored. This is currently the recommended (and default) option as it results in the best overall performance.

  • synthetic: we can generate synthetic P&C by using a pretrained NeMo model. Make sure to specify which model to use with the pc_model_path parameter. Recommended model for this config is punctuationcapitalization_it_it_bert_base (subject to the Riva license listed on the website).

  • ignore: we can just ignore missing P&C and still use the original audio and text. Note that this results in a mismatch, because some data will contain P&C and some will not. Still it was found experimentally to yield good performance.

Note

We ran experiments with all 3 options for the pc_mode parameter and found that full_drop results in best performance. Yet the gap between all 3 is not that big, so if you change the data pre-processing or want to refactor this config for other languages, it is recommended to try all 3 options and compare which one works best for your use-case.

In addition to the arguments of the MLS Italian (no P&C) config, the following new arguments are supported:

  • pc_mode (str): can be “full_drop”, “synthetic” or “ignore”. See above for the description of each option. Defaults to “full_drop”.

  • pc_model_path (str): path to the P&C NeMo model to use for synthetic P&C generation. Only applicable if pc_mode=synthetic.

Output manifest is the same, except text field contains P&C and there is a new field text_origin containing the source of the P&C for a given utterance (can be “original”, “synthetic” or “no_pc”).

Config link: dataset_configs/italian/mls/config.yaml