MLS Italian (with P&C)#
This config is the same as MLS Italian (no P&C)
except it adds punctuation and capitalization (P&C). MLS dataset does not
contain any P&C so it needs to be restored from the original source or synthetically
generated. We will run matching with the original source books and try to restore P&C
for as much data as possible. When this restoration process fails, there are
3 options that we support (controlled with the pc_mode
parameter):
full_drop: we can fully drop the utterances for which P&C was not restored. This is currently the recommended (and default) option as it results in the best overall performance.
synthetic: we can generate synthetic P&C by using a pretrained NeMo model. Make sure to specify which model to use with the
pc_model_path
parameter. Recommended model for this config is punctuationcapitalization_it_it_bert_base (subject to the Riva license listed on the website).ignore: we can just ignore missing P&C and still use the original audio and text. Note that this results in a mismatch, because some data will contain P&C and some will not. Still it was found experimentally to yield good performance.
Note
We ran experiments with all 3 options for the pc_mode
parameter and found
that full_drop
results in best performance. Yet the gap between all 3 is
not that big, so if you change the data pre-processing or want to refactor
this config for other languages, it is recommended to try all 3 options and
compare which one works best for your use-case.
In addition to the arguments of the MLS Italian (no P&C) config, the following new arguments are supported:
pc_mode (str): can be “full_drop”, “synthetic” or “ignore”. See above for the description of each option. Defaults to “full_drop”.
pc_model_path (str): path to the P&C NeMo model to use for synthetic P&C generation. Only applicable if
pc_mode=synthetic
.
Output manifest is the same, except text
field contains P&C and there
is a new field text_origin
containing the source of the P&C for a given
utterance (can be “original”, “synthetic” or “no_pc”).
Config link: dataset_configs/italian/mls/config.yaml