API#
Available processors#
Here is the full list of all available processors and their supported arguments.
Note
All SDP processors optionally accept input_manifest_file
and
output_manifest_file
keys. See Special fields section
for more details.
Dataset-specific processors#
MCV#
- sdp.processors.CreateInitialManifestMCV[source]#
Processor to create initial manifest for the Mozilla Common Voice (MCV) dataset.
Dataset link: https://commonvoice.mozilla.org/
Extracts raw MCV data for the specified language and creates an initial manifest using the transcripts provided in the raw data.
- Parameters:
raw_data_dir (str) – the path to the directory containing the raw data archive file. Needs to be manually downloaded from https://commonvoice.mozilla.org/.
extract_archive_dir (str) – directory where the extracted data will be saved.
resampled_audio_dir (str) – directory where the resampled audio will be saved.
data_split (str) – “train”, “dev” or “test”.
language_id (str) – the ID of the language of the data. E.g., “en”, “es”, “it”, etc.
already_extracted (bool) – if True, we will not try to extract the raw data. Defaults to False.
target_samplerate (int) – sample rate (Hz) to use for resampling. Defaults to 16000.
target_nchannels (int) – number of channels to create during resampling process. Defaults to 1.
- Returns:
This processor generates an initial manifest file with the following fields:
{ "audio_filepath": <path to the audio file>, "duration": <duration of the audio in seconds>, "text": <transcription (with capitalization and punctuation)>, }
MLS#
- sdp.processors.CreateInitialManifestMLS[source]
Processor to create initial manifest for the Multilingual LibriSpeech (MLS) dataset.
Dataset link: https://www.openslr.org/94/
Downloads and unzips raw MLS data for the specified language, and creates an initial manifest using the transcripts provided in the raw data.
- Parameters:
raw_data_dir (str) – the directory where the downloaded data will be/is saved. This is also where the extracted and processed data will be.
language (str) – the language of the data you wish to be downloaded. This will be used to format the URL from which we attempt to download the data. E.g., “english”, “italian”, “spanish”, etc.
data_split (str) – “train”, “dev” or “test”.
resampled_audio_dir (str or None) – if specified, the directory where the resampled wav files will be stored. If not specified, the audio will not be resampled and the parameters
target_samplerate
andtarget_nchannels
will be ignored.target_samplerate (int) – sample rate (Hz) to use for resampling. This parameter will be ignored if
resampled_audio_dir
isNone
. Defaults to 16000.target_nchannels (int) – number of channels to create during resampling process. This parameter will be ignored if
resampled_audio_dir
isNone
. Defaults to 1.use_opus_archive (bool) – if
True
, will use the version of the archive file which contains audio files saved in the OPUS format, instead of FLAC. The OPUS files take up less memory than the FLAC files, at the cost of the OPUS files being lower quality than the FLAC files. IfTrue
, the parameterresampled_audio_dir
must beNone
, as resampling OPUS audio files is currently not supported. Defaults to False.
- Returns:
This processor generates an initial manifest file with the following fields:
{ "audio_filepath": <path to the audio file>, "duration": <duration of the audio in seconds>, "text": <transcription>, }
- sdp.processors.RestorePCForMLS[source]#
Recovers original text from the MLS Librivox texts.
This processor can be used to restore punctuation and capitalization for the MLS data. Uses the original data in https://dl.fbaipublicfiles.com/mls/lv_text.tar.gz. Saves recovered text in
restored_text_field
field. If text was not recovered,restored_text_field
will be equal ton/a
.- Parameters:
language_long (str) – the full name of the language, used for choosing the folder of the contents of “https://dl.fbaipublicfiles.com/mls/lv_text.tar.gz”. E.g., “english”, “spanish”, “italian”, etc.
language_short (str or None) – the short name of the language, used for specifying the normalizer we want to use. E.g., “en”, “es”, “it”, etc. If set to None, we will not try to normalize the provided Librivox text.
lv_text_dir (str) – the directory where the contents of https://dl.fbaipublicfiles.com/mls/lv_text.tar.gz will be saved.
submanifests_dir (str) – the directory where submanifests (one for each combo of speaker + book) will be stored.
restored_submanifests_dir (str) – the directory where restored submanifests (one for each combo of speaker + book) will be stored.
restored_text_field (str) – the field where the recovered text will be stored.
n_jobs (int) – number of jobs to use for parallel processing. Defaults to -1.
show_conversion_breakdown (bool) – whether to show how much of each submanifest was restored. Defaults to True.
- Returns:
All the same data as in the input manifest with an additional key:
<restored_text_field>: <restored text or n/a if match was not found>``
VoxPopuli#
- sdp.processors.CreateInitialManifestVoxpopuli[source]#
Processor to create initial manifest for the VoxPopuli dataset.
Dataset link: facebookresearch/voxpopuli
Downloads and unzips raw VoxPopuli data for the specified language, and creates an initial manifest using the transcripts provided in the raw data.
Note
This processor will install a couple of Python packages, including PyTorch, so it might be a good idea to run it in an isolated Python environment.
- Parameters:
raw_data_dir (str) – the directory where the downloaded data will be/is saved.
language_id (str) – the language of the data you wish to be downloaded. E.g., “en”, “es”, “it”, etc.
data_split (str) – “train”, “dev” or “test”.
resampled_audio_dir (str) – the directory where the resampled wav files will be stored.
target_samplerate (int) – sample rate (Hz) to use for resampling. Defaults to 16000.
target_nchannels (int) – number of channels to create during resampling process. Defaults to 1.
- Returns:
This processor generates an initial manifest file with the following fields:
{ "audio_filepath": <path to the audio file>, "duration": <duration of the audio in seconds>, "text": <transcription (with provided normalization)>, "raw_text": <original transcription (without normalization)>, "speaker_id": <speaker id>, "gender": <speaker gender>, "age": <speaker age>, "is_gold_transcript": <whether the transcript has been verified>, "accent": <speaker accent, if known>, }
- sdp.processors.NormalizeFromNonPCTextVoxpopuli[source]#
Tries to restore punctuation and capitalization from the un-normalized text version.
VoxPopuli contains two versions of the transcription - original (non-normalized, but with punctuation and capitalization) and normalized (without punctuation or capitalization), but with digits and other forms normalized. This processor can be used to map the normalized and non-normalized versions and produce a normalized version with restored punctuation and capitalization.
Note
The current map logic is highly heuristical and might not work for all languages. The processor will return
n/a
for any text it was not able to restore, so make sure you check how much data was removed and consider updating the heuristics to retain more data.- Parameters:
restored_text_field (str) – the field where the recovered text (or
n/a
) will be stored. Defaults to “text”.raw_text_key (str) – which field contains the original text without normalization. Defaults to “raw_text”.
norm_text_key (str) – which field contains the normalized text. Defaults to “provided_norm_text”.
- Returns:
All the same data as in the input manifest with an additional key:
<restored_text_field>: <restored text or n/a if mapping failed>``
CORAAL#
- sdp.processors.CreateInitialManifestCORAAL[source]#
Processor to create initial manifest for the Corpus of Regional African American Language (CORAAL) dataset.
Dataset link: https://oraal.github.io/coraal
Will download all files, extract tars and split wav files based on the provided durations in the transcripts.
- Parameters:
raw_data_dir (str) – where to put raw downloaded data.
resampled_audio_dir (str) – where to put re-sampled and trimmed wav files.
target_samplerate (int) – sample rate to resample to. Defaults to 16000.
target_nchannels (int) – target number of channels. Defaults to 1.
drop_pauses (bool) – if True, will drop all transcriptions that contain only silence (indicated by
(pause X)
in the transcript). Defaults to True.group_duration_threshold (float) – can be used to group consecutive utterances from the same speaker to a longer duration. Set to 0 to disable this grouping (but note that many utterances are transcribed with only a few seconds, so grouping is generally advised). Defaults to 20.
- Returns:
This processor generates an initial manifest file with the following fields:
{ "audio_filepath": <path to the audio file>, "duration": <duration of the audio in seconds>, "text": <transcription>, "original_file": <name of the original file in the dataset this audio came from>, "speaker": <speaker id>, "is_interviewee": <whether this is an interviewee (accented speech)>, "gender": <speaker gender>, "age": <speaker age>, "education": <speaker education>, "occupation": <speaker occupation>, }
- sdp.processors.TrainDevTestSplitCORAAL[source]#
Custom train-dev-test split for CORAAL dataset.
Split is done speaker-wise, so the same speakers don’t appear in different splits.
- Parameters:
data_split (str) – train, dev or test.
- Returns:
All the same fields as in the input manifest, but only a subset of the data is retained.
Librispeech#
- sdp.processors.CreateInitialManifestLibrispeech[source]#
Processor to create initial manifest for the Librispeech dataset.
Dataset link: https://openslr.org/12 Dataset link: https://openslr.org/31
Will download all files, extract tars, and create a manifest file with the “audio_filepath” and “text” fields.
- Parameters:
split (str) –
Which datasets or their combinations should be processed. Options are:
"dev-clean"
"dev-other"
"test-clean"
"test-other"
"train-clean-100"
"train-clean-360"
"train-other-500"
"dev-clean-2"
"train-clean-5"
raw_data_dir (str) – Path to the folder where the data archive should be downloaded and extracted.
- Returns:
This processor generates an initial manifest file with the following fields:
{ "audio_filepath": <path to the audio file>, "text": <transcription>, }
SLR83#
- sdp.processors.CreateInitialManifestSLR83[source]#
Processor to create initial manifest for the SLR83 dataset.
This is a dataset introduced in Open-source Multi-speaker Corpora of the English Accents in the British Isles.
- Parameters:
raw_data_dir (str) – where to put raw downloaded data.
dialect (str) –
should be one of the
irish_english_male
midlands_english_female
midlands_english_male
northern_english_female
northern_english_male
scottish_english_female
scottish_english_male
southern_english_female
southern_english_male
welsh_english_female
welsh_english_male
- Returns:
This processor generates an initial manifest file with the following fields:
{ "audio_filepath": <path to the audio file>, "duration": <duration of the audio in seconds>, "text": <transcription>, }
- sdp.processors.CustomDataSplitSLR83[source]#
Splits SLR83 data into train, dev or test subset.
The original paper does not provide train/dev/test splits, so we include a custom processing that can be used as a standardized split to compare results. For more details on this data split see Damage Control During Domain Adaptation for Transducer Based Automatic Speech Recognition.
Note
All data dropping has to be done before the split. We will check the total number of files to be what is expected in the reference split. But if you add any custom pre-processing that changes duration or number of files, your splits will likely be different.
- Parameters:
dialect (str) – same as in the
sdp.processors.CreateInitialManifestSLR83
.data_split (str) – “train”, “dev” or “test”.
- Returns:
All the same fields as in the input manifest, but only a subset of the data is retained.
MTEDx ‘’’
- sdp.processors.CreateInitialManifestMTEDX[source]#
Processor to create initial manifest for the Multilingual TEDx (MTedX dataset.
Dataset link: https://www.openslr.org/100/
Downloads dataset for the specified language and creates initial manifest with the provided audio and vtt files.
- Parameters:
raw_data_dir (str) – the directory where the downloaded data will be/is saved. This is also where the extracted and processed data will be.
data_split (str) – “train”, “dev” or “test”.
language_id (str) – the ID of the language of the data. E.g., “en”, “es”, “it”, etc.
target_samplerate (int) – sample rate (Hz) to use for resampling.
already_extracted – (bool): if True, we will not try to extract the raw data. Defaults to False.
- Returns:
This processor generates an initial manifest file with the following fields:
{ "audio_filepath": <path to the audio file>, "vtt_filepath": <path to the corresponding vtt file> "duration": <duration of the audio in seconds> }
Coraa ‘’’
- sdp.processors.CreateInitialManifestCORAA[source]#
Processor to create initial manifest file fo CORAA ASR dataset
Dataset link: nilc-nlp/CORAA
- Parameters:
raw_data_dir (str) – the path to the directory in which all the data will be downloaded.
extract_archive_dir (str) – directory where the extracted data will be saved.
data_split (str) – “train”, “dev” or “test”.
resampled_audio_dir (str) – the directory where the resampled wav files will be stored.
already_extracted (bool) – if True, we will not try to extract the raw data. Defaults to False.
already_downloaded (bool) – if True, we will not try to download files.
target_samplerate (int) – sample rate (Hz) to use for resampling. This parameter will Defaults to 16000.
target_nchannels (int) – number of channels to create during resampling process. Defaults to 1.
exclude_dataset – list: list of the dataset names that will be excluded when creating initial manifest. Options ‘SP2010’, ‘C-ORAL-BRASIL I’, ‘NURC-Recife’, ‘TEDx Talks’, ‘ALIP’
FLEURS#
- sdp.processors.CreateInitialManifestFleurs[source]#
Processor to create initial manifest for the FLEURS dataset.
Dataset link: https://huggingface.co/datasets/google/fleurs
Will download all files, extract them, and create a manifest file with the “audio_filepath” and “text” fields.
- Parameters:
lang (str) –
Language to be processed, identified by a combination of ISO 639-1 and ISO 3166-1 alpha-2 codes. Examples are:
"hy_am"
for Armenian"ko_kr"
for Korean
split (str) –
Which dataset splits to process. Options are:
"test"
"train"
"dev"
raw_data_dir (str) – Path to the folder where the data archive should be downloaded and extracted.
- Returns:
This processor generates an initial manifest file with the following fields:
{ "audio_filepath": <path to the audio file>, "text": <transcription>, }
UzbekVoice#
- sdp.processors.CreateInitialManifestUzbekvoice[source]#
Processor to create initial manifest for the Uzbekvoice dataset.
Will download all files, extract them, and create a manifest file with the “audio_filepath”, “text” and “duration” fields.
- Parameters:
raw_data_dir (str) – Path to the folder where the data archive should be downloaded and extracted.
- Returns:
This processor generates an initial manifest file with the following fields:
{ "audio_filepath": <path to the audio file>, "text": <transcription>, }
Lhotse processors#
The following processors leverage Lhotse, a speech data handling library that contains data preparation recipes for 80+ publicly available datasets. Lhotse has its own data manifest format that can be largely mapped into NeMo’s format.
- sdp.processors.LhotseImport[source]#
Processor to create an initial manifest imported from a Lhotse CutSet. The
input_manifest_file
is expected to point to a Lhotse CutSet manifest, which usually hascuts
in its name and a.jsonl
or.jsonl.gz
extension.Lhotse is a library for speech data processing and loading; see:
It can be installed using
pip install lhotse
.Caution
Currently we only support the importing of cut sets that represent single-channel, single-audio-file-per-utterance datasets.
- Returns:
This processor generates an initial manifest file with the following fields:
{ "audio_filepath": <path to the audio file>, "duration": <duration of the audio in seconds>, "text": <transcription (with capitalization and punctuation)>, }
Data enrichment#
The following processors can be used to add additional attributes to the data by running different NeMo models (e.g., ASR predictions). These attributes are typically used in the downstream processing for additional enhancement or filtering.
- sdp.processors.ASRInference[source]#
This processor performs ASR inference on each utterance of the input manifest.
ASR predictions will be saved in the
pred_text
key.- Parameters:
pretrained_model (str) – the name of the pretrained NeMo ASR model which will be used to do inference.
batch_size (int) – the batch size to use for ASR inference. Defaults to 32.
- Returns:
The same data as in the input manifest with an additional field
pred_text
containing ASR model’s predictions.
- sdp.processors.PCInference[source]#
Adds predictions of a text-based punctuation and capitalization (P&C) model.
Operates on the text in the
input_text_field
, and saves predictions in theoutput_text_field
.- Parameters:
input_text_field (str) – the text field that will be the input to the P&C model.
output_text_field (str) – the text field where the output of the PC model will be saved.
batch_size (int) – the batch sized used by the P&C model.
device (str) – the device used by the P&C model. Can be skipped to auto-select.
pretrained_name (str) – the pretrained_name of the P&C model.
model_path (str) – the model path to the P&C model.
Note
Either
pretrained_name
ormodel_path
have to be specified.- Returns:
The same data as in the input manifest with an additional field <output_text_field> containing P&C model’s predictions.
- sdp.processors.ASRTransformers[source]#
Processor to transcribe using ASR Transformers model from HuggingFace.
- Parameters:
pretrained_model (str) – name of pretrained model on HuggingFace.
output_text_key (str) – Key to save transcription result.
input_audio_key (str) – Key to read audio file. Defaults to “audio_filepath”.
input_duration_key (str) – Audio duration key. Defaults to “duration”.
device (str) – Inference device.
batch_size (int) – Inference batch size. Defaults to 1.
chunk_length_s (int) – Length of the chunks (in seconds) into which the input audio should be divided. Note: Some models perform the chunking on their own (for instance, Whisper chunks into 30s segments also by maintaining the context of the previous chunks).
torch_dtype (str) – Tensor data type. Default to “float32”
max_new_tokens (Optional[int]) – The maximum number of new tokens to generate. If not specified, there is no hard limit on the number of tokens generated, other than model-specific constraints.
Text-only processors#
Note
All processors in this section accept additional parameter
text_key
(defaults to “text”) to control which field is used
for modifications/filtering.
- sdp.processors.ReadTxtLines[source]#
The text file specified in source_filepath will be read, and each line in it will be added as a line in the output manifest, saved in the field text_key.
- Parameters:
input_file_key (str) – The key in the manifest containing the input txt file path .
text_key (str) – The key to store the read text lines in the manifest.
**kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.
Data modifications#
- sdp.processors.SubRegex[source]
Converts a regex match to a string, as defined by key-value pairs in
regex_to_sub
.Before applying regex changes, we will add a space character to the beginning and end of the
text
andpred_text
keys for each data entry. After the the regex changes, the extra spaces are removed. This includes the spaces in the beginning and end of the text, as well as any double spaces" "
.- Parameters:
regex_params_list (list[dict]) – list of dicts. Each dict must contain a
pattern
and arepl
key, and optionally acount
key (by default,count
will be 0). This processor will go through the list in order, and apply are.sub
operation on the input text indata_entry[self.text_key]
, feeding in the specifiedpattern
,repl
andcount
parameters tore.sub
.text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
- Returns:
The same data as in the input manifest with
<text_key>
field changed.
- sdp.processors.SubMakeLowercase[source]#
Processor to convert text to lowercase.
- text_key (str): a string indicating which key of the data entries
should be used to find the utterance transcript. Defaults to “text”.
- Returns:
The same data as in the input manifest with
<text_key>
field changed.
- sdp.processors.MakeLettersUppercaseAfterPeriod[source]#
Can be used to replace characters with upper-case version after punctuation.
- Parameters:
punctuation (str) – string with all punctuation characters to consider. Defaults to “.!?”.
text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
- Returns:
The same data as in the input manifest with
<text_key>
field changed.
- sdp.processors.SplitLineBySentence[source]#
Processor for splitting lines of text into sentences based on a specified pattern. One line containing N sentences will be transformed into N lines containing one sentence.
- Parameters:
text_key (str) – The field containing the text lines in the dataset.
end_pattern (str) – The regular expression pattern to identify sentence boundaries.
**kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.
- sdp.processors.CountNumWords[source]#
Processor for counting the number of words in the text_key field saving the number in num_words_key.
- Parameters:
text_key (str) – The field containing the input text in the dataset.
num_words_key (str) – The field to store the number of words in the dataset.
alphabet (str) – Characters to be used to count words. Any other characters are substituted by whitespace and not take into account.
**kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.
- sdp.processors.NormalizeText[source]#
This processor applies text normalization (TN) to the text. I.e. converts text from written form into its verbalized form. E.g., “$123” is converted to “one hundred and twenty-three dollars.”
- Parameters:
input_text_key (str) – the text field that will be the input to the Normalizer. Defaults to: text.
input_language (str) – language specifying the text normalization rules in ISO 639 Set 1 format. E.g., “en”, “es”, “it”, etc. Defaults to: English.
input_case (str) – input text capitalization, set to cased if text contains capital letters. This flag affects normalization rules applied to the text. Note, lower_cased won’t lower case input. Defaults to: cased.
output_text_key (str) – the text field that will be the output from the Normalizer. Defaults to: text.
- Returns:
This processor normalizes the text in the input_text_key field and saves the normalized text in output_text_key field.
- Raises:
NotImplementedError – when TN is not implemented for the requested language.
- sdp.processors.InverseNormalizeText[source]#
This processor applies inverse text normalization (ITN) to the text. I.e. transforms spoken forms of numbers, dates, etc into their written equivalents. E.g., “one hundred and twenty-three dollars.” is converted to “$123”.
- Parameters:
input_text_key (str) – the text field that will be the input to the InverseNormalizer. Defaults to: text.
input_language (str) – language specifying the text normalization rules in ISO 639 Set 1 format. E.g., “en”, “es”, “it”, etc. Defaults to: English.
input_case (str) – input text capitalization, set to cased if text contains capital letters. This flag affects normalization rules applied to the text. Note, lower_cased won’t lower case input. Defaults to: cased.
output_text_key (str) – the text field that will be the output from the InverseNormalizer. Defaults to: text.
- Returns:
This processor inverse normalizes the text in the input_text_key field and saves the inverse normalized text in output_text_key field.
- Raises:
NotImplementedError – when ITN is not implemented for the requested language.
Data filtering#
- sdp.processors.DropIfRegexMatch[source]#
Drops utterances if text matches a regex pattern.
Before applying regex checks, we will add a space character to the beginning and end of the
text
andpred_text
keys for each data entry. After the the regex checks, assuming the utterance isn’t dropped, the extra spaces are removed. This includes the spaces in the beginning and end of the text, as well as any double spaces" "
.- Parameters:
regex_patterns (list[str]) – a list of strings. The list will be traversed in order. If
data_entry.data[self.text_key]
matches the regex, the entry will be dropped.text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
- Returns:
The same data as in the input manifest with some entries dropped.
- sdp.processors.DropIfNoneOfRegexMatch[source]#
Drops utterances if
data[self.text_key]
does not match any ofregex_patterns
.Before applying regex checks, we will add a space character to the beginning and end of the
text
andpred_text
keys for each data entry. After the the regex checks, assuming the utterance isn’t dropped, the extra spaces are removed. This includes the spaces in the beginning and end of the text, as well as any double spaces" "
.- Parameters:
regex_patterns (list[str]) – If
data_entry[self.text_key]
does not match any of the regex patterns in the list, that utterance will be dropped.text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
- Returns:
The same data as in the input manifest with some entries dropped.
- sdp.processors.DropNonAlphabet[source]#
Drops utterances if they contain characters that are not in the
alphabet
.- Parameters:
alphabet (str) – a string containing all of the characters in our alphabet. If an utterance contains at least one character that is not in the
alphabet
, then that utterance will be dropped.text_key (str) –
a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
Note
Don’t forget to include spaces in your alphabet, unless you want to make sure none of the utterances contain spaces.
- Returns:
The same data as in the input manifest with some entries dropped.
- sdp.processors.DropOnAttribute[source]#
Drops utterances if attribute is set to True/False.
- Parameters:
key (str) – which key to use for dropping utterances.
drop_if_false (bool) – whether to drop if value is False. Defaults to dropping if True.
- Returns:
The same data as in the input manifest with some entries dropped.
ASR-based processors#
Note
All processors in this section depend on the sdp.processors.ASRInference
.
So make sure to include it in the config at some prior stage with an applicable
ASR model.
Note
All processors in this section accept additional parameters
text_key
(defaults to “text”) and pred_text_key
(defaults to “text_pred”)
to control which fields contain transcription and ASR model predictions.
Data modifications#
- sdp.processors.SoxConvert[source]#
Processor for converting audio files from one format to another using Sox, and updating the dataset with the path to the converted audio files.
- Parameters:
converted_audio_dir (str) – Directory to store the converted audio files.
input_audio_file_key (str) – Field in the dataset representing the path to input audio files.
output_audio_file_key (str) – Field to store the path to the converted audio files in the dataset.
output_format (str) – Format of the output audio files (e.g., ‘wav’, ‘mp3’).
**kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.
- sdp.processors.InsIfASRInsertion[source]#
Processor that adds substrings to transcription if they are present in ASR predictions.
Will insert substrings into
data[self.text_key]
if it is present at that location indata[self.pred_text_key]
. It is useful if words are systematically missing from ground truth transcriptions.- Parameters:
insert_words (list[str]) – list of strings that will be inserted into
data[self.text_key]
if there is an insertion (containing only that string) indata[self.pred_text_key]
.text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
pred_text_key (str) –
a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.
Note
Because this processor looks for an exact match in the insertion, we recommend including variations with different spaces in
insert_words
, e.g.[' nemo', 'nemo ', ' nemo ']
.
- Returns:
The same data as in the input manifest with
<text_key>
field changed.
- sdp.processors.SubIfASRSubstitution[source]#
Processor that substitutes substrings to transcription if they are present in ASR predictions.
Will convert a substring in
data[self.text_key]
to a substring indata[self.pred_text_key]
if both are located in the same place (ie are part of a ‘substitution’ operation) and if the substrings correspond to key-value pairs insub_words
. This is useful if words are systematically incorrect in ground truth transcriptions.Before starting to look for substitution, this processor adds spaces at the beginning and end of
data[self.text_key]
anddata[self.pred_text_key]
, to ensure that an argument likesub_words = {"nmo ": "nemo "}
would cause a substitution to be made even if the originaldata[self.text_key]
ends with"nmo"
anddata[self.pred_text_key]
ends with"nemo"
.- Parameters:
sub_words (dict) – dictionary where a key is a string that might be in
data[self.text_key]
and the value is the string that might be indata[self.pred_text_key]
. If both are located in the same place (i.e. are part of a ‘substitution’ operation) then the key string will be converted to the value string indata[self.text_key]
.text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
pred_text_key (str) –
a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.
Note
This processor looks for exact string matches of substitutions, so you may need to be careful with spaces in
sub_words
. E.g. it is recommended to dosub_words = {"nmo ": "nemo "}
instead ofsub_words = {"nmo" : "nemo"}
.
- Returns:
The same data as in the input manifest with
<text_key>
field changed.
Data filtering#
- sdp.processors.PreserveByValue[source]#
Processor for preserving dataset entries based on a specified condition involving a target value and an input field.
- Parameters:
input_value_key (str) – The field in the dataset entries to be evaluated.
target_value (Union[int, str]) – The value to compare with the input field.
operator (str) – (Optional) The operator to apply for comparison. Options: “lt” (less than), “le” (less than or equal to), “eq” (equal to), “ne” (not equal to), “ge” (greater than or equal to), “gt” (greater than). Defaults to “eq”.
**kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.
- sdp.processors.DropASRError[source]#
Drops utterances if there is a sufficiently long ASR mismatch anywhere in the utterance.
- Parameters:
consecutive_words_threshold (int) – will drop if there is a mismatch of at least this many words in a row.
text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.
- Returns:
The same data as in the input manifest with some entries dropped.
- sdp.processors.DropASRErrorBeginningEnd[source]#
Drops utterances if there is a sufficiently long ASR mismatch at the beginning or end of the utterance.
- Parameters:
beginning_error_char_threshold (int) – if there is an insertion or deletion at the beginning of the utterance that has more characters than this number, then the utterance will be dropped. If there is a substitution at the beginning of the utterance, then the utterance will be dropped if
abs(len(deletion) - len(insertion)) > beginning_error_char_threshold
.end_error_char_threshold (int) – if there is an insertion or deletion at the end of the utterance that has more characters than this number, then the utterance will be dropped. If there is a substitution at the end of the utterance, then the utterance will be dropped if
abs(len(deletion) - len(insertion)) > end_error_char_threshold
.text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.
- Returns:
The same data as in the input manifest with some entries dropped.
- sdp.processors.DropIfSubstringInInsertion[source]#
Drops utterances if a substring matches an ASR insertion.
Insertions are checked between
data[self.text_key]
anddata[self.pred_text_key]
.Note
We check for exact matches, so you need to be mindful of spaces, e.g. you may wish to do
substrings_in_insertion = ["nemo "]
instead ofsubstrings_in_insertion = ["nemo"]
.- Parameters:
substrings_in_insertion (list[str]) – a list of strings which might be inserted in predicted ASR text. If the insertion matches a string exactly, the utterance will be dropped.
text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.
- Returns:
The same data as in the input manifest with some entries dropped.
- sdp.processors.DropHighCER[source]#
Drops utterances if there is a sufficiently high character-error-rate (CER).
CER is measured between
data[self.text_key]
anddata[self.pred_text_key]
.Note
We only drop the utterance if
CER > threshold
(i.e. strictly greater than) so that if we set the threshold to 0, we will not remove utterances withCER == 0
.- Parameters:
cer_threshold (float) – CER threshold above which the utterance will be dropped.
text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.
- Returns:
The same data as in the input manifest with some entries dropped.
- sdp.processors.DropHighWER[source]#
Drops utterances if there is a sufficiently high word-error-rate (WER).
WER is measured between
data[self.text_key]
anddata[self.pred_text_key]
.Note
We only drop the utterance if
WER > threshold
(i.e. strictly greater than) so that if we set the threshold to 0, we will not remove utterances withWER == 0
.- Parameters:
wer_threshold (float) – WER threshold above which the utterance will be dropped.
text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.
- Returns:
The same data as in the input manifest with some entries dropped.
- sdp.processors.DropLowWordMatchRate[source]#
Drops utterances if there is a sufficiently low word-match-rate (WMR).
WMR is measured between
data[self.text_key]
anddata[self.pred_text_key]
.Note
We only drop the utterance if
WMR < threshold
(i.e. strictly lower than) so that if we set the threshold to 100, we will not remove utterances withWMR == 100
.- Parameters:
wmr_threshold (float) – WMR threshold below which the utterance will be dropped.
text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.
- Returns:
The same data as in the input manifest with some entries dropped.
- sdp.processors.DropHighLowCharrate[source]
Drops utterances if their character rate is too low or too high.
Character rate =
(num of characters in self.text_key) / (duration of audio)
. A too-low or too-high character rate often implies that the ground truth transcription might be inaccurate.- Parameters:
high_charrate_threshold (float) – upper character rate threshold. If the character rate of an utterance is higher than this number, the utterance will be dropped.
low_charrate_threshold (float) – lower character rate threshold. If the character rate of an utterance is lower than this number, the utterance will be dropped.
text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
- Returns:
The same data as in the input manifest with some entries dropped.
- sdp.processors.DropHighLowWordrate[source]#
Drops utterances if their word rate is too low or too high.
Word rate =
(num of words in self.text_key) / (duration of audio)
. A too-low or too-high word rate often implies that the ground truth transcription might be inaccurate.- Parameters:
high_wordrate_threshold (float) – upper word rate threshold. If the word rate of an utterance is higher than this number, the utterance will be dropped.
low_wordrate_threshold (float) – lower word rate threshold. If the word rate of an utterance is lower than this number, the utterance will be dropped.
text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.
- Returns:
The same data as in the input manifest with some entries dropped.
- sdp.processors.DropHighLowDuration[source]#
Drops utterances if their duration is too low or too high.
- Parameters:
high_duration_threshold (float) – upper duration threshold (in seconds). If the duration of an utterance’s audio is higher than this number, the utterance will be dropped.
low_duration_threshold (float) – lower duration threshold (in seconds). If the duration of an utterance’s audio is lower than this number, the utterance will be dropped.
duration_key (str) – a string indicating which key of the data entries should be used to find the utterance duration. Defaults to “duration”.
- Returns:
The same data as in the input manifest with some entries dropped.
- sdp.processors.DropRepeatedFields[source]#
Drops utterances from the current manifest if their text fields are present in other manifests.
This class processes multiple manifest files and removes entries from the current manifest if the text field matches any entry in the other manifests. It allows for optional punctuation removal from the text fields before performing the check.
Note
It is better to process Test/Dev/Train and then Other.tsv
- Parameters:
manifests_paths (list[str]) – List of paths to the manifest files to check against.
current_manifest_file (str) – Path to the current manifest file to be processed.
punctuations (str) – (Optional): String of punctuation characters to be removed from the text fields before checking for duplicates. Defaults to None.
text_key (str) – The key in the manifest entries that contains the text field. Defaults to “text”.
- Returns:
The same data as in the input manifest with some entries dropped.
Miscellaneous#
- sdp.processors.AddConstantFields[source]#
This processor adds constant fields to all manifest entries.
E.g., can be useful to add fixed
label: <language>
field for downstream language identification model training.- Parameters:
fields –
dictionary with any additional information to add. E.g.:
fields = { "label": "en", "metadata": "mcv-11.0-2022-09-21", }
- Returns:
The same data as in the input manifest with added fields as specified in the
fields
input dictionary.
- sdp.processors.CombineSources[source]#
Can be used to create a single field from two alternative sources.
E.g.:
_target_: sdp.processors.CombineSources sources: - field: text_pc origin_label: original - field: text_pc_pred origin_label: synthetic - field: text origin_label: no_pc target: text
will populate the
text
field with data fromtext_pc
field if it’s present and not equal ton/a
(can be customized). Iftext_pc
is not available, it will populatetext
fromtext_pc_pred
field, following the same rules. If both are not available, it will fall back to thetext
field itself. In all cases it will specify which source was used in thetext_origin
field by using the label from theorigin_label
field.. If non of the sources is available, it will populate both the target and the origin fields withn/a
.- Parameters:
sources (list[dict]) –
list of the sources to use in order of preference. Each element in the list should be in the following format:
{ field: <which field to take the data from> origin_label: <what to write in the "<target>_origin" }
target (str) – target field that we are populating.
na_indicator (str) – if any source field has text equal to the
na_indicator
it will be considered as not available. If none of the sources are present, this will also be used as the value for the target and origin fields. Defaults ton/a
.
- Returns:
The same data as in the input manifest enhanced with the following fields:
<target>: <populated with data from either <source1> or <source2> or with <na_indicator> if none are available> <target>_origin: <label that marks where the data came from>
- sdp.processors.DuplicateFields[source]#
This processor duplicates fields in all manifest entries.
It is useful for when you want to do downstream processing of a variant of the entry. E.g. make a copy of “text” called “text_no_pc”, and remove punctuation from “text_no_pc” in downstream processors.
- Parameters:
duplicate_fields (dict) – dictionary where keys are the original fields to be copied and their values are the new names of the duplicate fields.
- Returns:
The same data as in the input manifest with duplicated fields as specified in the
duplicate_fields
input dictionary.
Example
- _target_: sdp.processors.modify_manifest.common.DuplicateFields input_manifest_file: ${workspace_dir}/test1.json output_manifest_file: ${workspace_dir}/test2.json duplicate_fields: {"text":"answer"}
- sdp.processors.RenameFields[source]#
This processor renames fields in all manifest entries.
- Parameters:
rename_fields – dictionary where keys are the fields to be renamed and their values are the new names of the fields.
- Returns:
The same data as in the input manifest with renamed fields as specified in the
rename_fields
input dictionary.
- sdp.processors.SplitOnFixedDuration[source]#
This processor splits audio into a fixed length segments.
It does not actually create different audio files, but simply adds corresponding
offset
andduration
fields. These fields can be automatically processed by NeMo to split audio on the fly during training.- Parameters:
segment_duration (float) – fixed desired duration of each segment.
drop_last (bool) – whether to drop the last segment if total duration is not divisible by desired segment duration. If False, the last segment will be of a different length which is
< segment_duration
. Defaults to True.drop_text (bool) – whether to drop text from entries as it is most likely inaccurate after the split on duration. Defaults to True.
- Returns:
The same data as in the input manifest but all audio that’s longer than the
segment_duration
will be duplicated multiple times with additionaloffset
andduration
fields. Ifdrop_text=True
will also droptext
field from all entries.
- sdp.processors.ChangeToRelativePath[source]#
This processor changes the audio filepaths to be relative.
- Parameters:
base_dir – typically a folder where manifest file is going to be stored. All passes will be relative to that folder.
- Returns:
The same data as in the input manifest with
audio_filepath
key changed to contain relative path to thebase_dir
.
- sdp.processors.SortManifest[source]#
Processor which will sort the manifest by some specified attribute.
- Parameters:
attribute_sort_by (str) – the attribute by which the manifest will be sorted.
descending (bool) – if set to False, attribute will be in ascending order. If True, attribute will be in descending order. Defaults to True.
- Returns:
The same entries as in the input manifest, but sorted based on the provided parameters.
- sdp.processors.KeepOnlySpecifiedFields[source]#
Saves a copy of a manifest but only with a subset of the fields.
Typically will be the final processor to save only relevant fields in the desired location.
- Parameters:
fields_to_keep (list[str]) – list of the fields in the input manifest that we want to retain. The output file will only contain these fields.
- Returns:
The same data as in input manifest, but re-saved in the new location with only
fields_to_keep
fields retained.
- sdp.processors.GetAudioDuration[source]#
Processor that computes the duration of the file in
audio_filepath_key
(using soundfile) and saves the duration induration_key
. If there is an error computing the duration, the value atduration_key
will be updated with the value -1.0.- Parameters:
audio_filepath_key (str) – Key to get path to wav file.
duration_key (str) – Key to put to audio duration.
- Returns:
All the same fields as in the input manifest plus duration_key
- sdp.processors.FfmpegConvert[source]#
Processor for converting video or audio files to audio using FFmpeg and updating the dataset with the path to the resampled audio. If
id_key
is not None, the output file path will be<resampled_audio_dir>/<id_key>.wav
. Ifid_key
is None, the output file path will be<resampled_audio_dir>/<input file name without extension>.wav
.Note
id_key
can be used to create subdirectories insideresampled_audio_dir
(by using forward slashes/
). e.g. ifid_key
takes the formdir_name1/dir_name2/filename
, the output file path will be<resampled_audio_dir>/dir_name1/dirname2/filename.wav
.- Parameters:
converted_audio_dir (str) – The directory to store the resampled audio files.
input_file_key (str) – The field in the dataset representing the path to the input video or audio files.
output_file_key (str) – The field in the dataset representing the path to the resampled audio files with
output_format
. Ifid_key
is None, the output file path will be<resampled_audio_dir>/<input file name without extension>.wav
.id_key (str) – (Optional) The field in the dataset representing the unique ID or identifier for each entry. If
id_key
is not None, the output file path will be<resampled_audio_dir>/<id_key>.wav
. Defaults to None.output_format (str) – (Optional) Format of the output audio files. Defaults to wav.
target_samplerate (int) – (Optional) The target sampling rate for the resampled audio. Defaults to 16000.
target_nchannels (int) – (Optional) The target number of channels for the resampled audio. Defaults to 1.
**kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.
- sdp.processors.CreateInitialManifestByExt[source]#
Processor for creating an initial dataset manifest by saving filepaths with a common extension to the field specified in output_field.
- Parameters:
raw_data_dir (str) – The root directory of the files to be added to the initial manifest. This processor will recursively look for files with the extension ‘extension’ inside this directory.
output_file_key (str) – The key to store the paths to the files in the dataset.
extension (str) – The file extension of the of the files to be added to the manifest.
**kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.
- sdp.processors.ApplyInnerJoin[source]#
Applies inner join to two manifests, i.e. creates a manifest from records that have matching values in both manifests. For more information, please refer to the Pandas merge function documentation: https://pandas.pydata.org/docs/reference/api/pandas.merge.html#pandas.merge
- Parameters:
column_id (Union[str, List[str], None]) – Field names to join on. These must be found in both manifests. If column_id is None then this defaults to the intersection of the columns in both manifests. Defaults to None.
left_manifest_file (Optional[str]) – path to the left manifest. Defaults to input_manifest_file.
right_manifest_file (str) – path to the right manifest.
- Returns:
Inner join of two manifests.
Base classes#
This section lists all the base classes you might need to know about if you want to add new SDP processors.
BaseProcessor#
- class sdp.processors.base_processor.BaseProcessor(output_manifest_file: str, input_manifest_file: str | None = None)[source]#
Bases:
ABC
Abstract class for SDP processors.
All processor classes inherit from the
BaseProcessor
class. This is a simple abstract class which has 2 empty methods:process()
andtest()
.These serve to remind us that SDP essentially just runs
.test()
on all processors (to implement run-time tests), and then.process()
on all processors.- Parameters:
output_manifest_file (str) – path of where the output manifest file will be located. Cannot have the same value as
input_manifest_file
.input_manifest_file (str) – path of where the input manifest file is located. This arg is optional - some processors may not take in an input manifest because they need to create an initial manifest from scratch (ie from some transcript file that is in a format different to the NeMo manifest format). Cannot have the same value as
input_manifest_file
.
BaseParallelProcessor#
- class sdp.processors.base_processor.BaseParallelProcessor(max_workers: int = -1, chunksize: int = 100, in_memory_chunksize: int = 1000000, test_cases: List[Dict] | None = None, **kwargs)[source]#
Bases:
BaseProcessor
Processor class which allows operations on each utterance to be parallelized.
Parallelization is done using
tqdm.contrib.concurrent.process_map
inside theprocess()
method. Actual processing should be defined on a per-examples bases inside theprocess_dataset_entry()
method.See the documentation of all the methods for more details.
- Parameters:
max_workers (int) – maximum number of workers that will be spawned during the parallel processing.
chunksize (int) – the size of the chunks that will be sent to worker processes during the parallel processing.
in_memory_chunksize (int) – the maximum number of input data entries that will be read, processed and saved at a time.
test_cases (list[dict]) – an optional list of dicts containing test cases for checking that the processor makes the changes that we are expecting. The dicts must have a key
input
, the value of which is a dictionary containing data which is our test’s input manifest line, and a keyoutput
, the value of which is a dictionary containing data which is the expected output manifest line.
- process()[source]#
Parallelized implementation of the data processing.
The execution flow of this method is the following.
prepare()
is called. It’s empty by default but can be used to e.g. download the initial data files or compute some aggregates required for subsequent processing.A for-loop begins that loops over all
manifest_chunk
lists yielded by the_chunk_manifest()
method._chunk_manifest()
reads data entries yielded byread_manifest()
and yields lists containingin_memory_chunksize
data entries.Inside the for-loop:
process_dataset_entry()
is called in parallel on each element of themanifest_chunk
list.All metrics are aggregated.
All output data-entries are added to the contents of
output_manifest_file
.
Note:
The default implementation of
read_manifest()
reads an input manifest file and returns a list of dictionaries for each line (we assume a standard NeMo format of one json per line).process_dataset_entry()
is called in parallel on each element of the list created in the previous step. Note that you cannot create any new counters or modify the attributes of this class in any way inside that function as this will lead to an undefined behavior. Each call to theprocess_dataset_entry()
returns a list ofDataEntry
objects that are then aggregated together.DataEntry
simply defines adata
andmetrics
keys.If
data
is set to None, the objects are ignored (metrics are still collected).
All
metrics
keys that were collected in the for-loop above are passed over tofinalize()
for any desired metric aggregation and reporting.
Here is a diagram outlining the execution flow of this method:
- prepare()[source]#
Can be used in derived classes to prepare the processing in any way.
E.g., download data or compute some aggregates. Will be called before starting processing the data.
- read_manifest()[source]#
Reading the input manifest file.
Note
This function should be overridden in the “initial” class creating manifest to read from the original source of data.
- abstract process_dataset_entry(data_entry) List[DataEntry] [source]#
Needs to be implemented in the derived classes.
Each returned value should be a
DataEntry
object that will hold a dictionary (or anything else that can be json-serialized) with the actual data + any additional metrics required for statistics reporting. Those metrics can be used infinalize()
to prepare for final reporting.DataEntry
is a simple dataclass defined in the following way:@dataclass class DataEntry: # can be None to drop the entry data: Optional[Dict] # anything - you'd need to aggregate all # values in the finalize method manually metrics: Any = None
Note
This method should always return a list of objects to allow a one-to-many mapping. E.g., if you want to cut an utterance into multiple smaller parts, you can return a list of all the produced utterances and they will be handled correctly.
The many-to-one mapping is not currently supported by design of this method (but can still be done if you don’t inherit from this class and process the data sequentially).
- Parameters:
data_entry – most often,
data_entry
will be a dictionary containing items which represent the JSON manifest entry. Sometimes, such as insdp.processors.CreateInitialManifestMLS
, it will be a string containing a line for that utterance from the original raw MLS transcript. In general it is an element of the list returned from theread_manifest()
method.
- finalize(metrics: List)[source]#
Can be used to output statistics about the processed data.
By default outputs new number of entries/hours.
- Parameters:
metrics (list) – a list containing all
metrics
keys from the data entries returned from theprocess_dataset_entry()
method.
Runtime tests#
Before running the specified processors, SDP runs processor.test()
on all specified processors.
A test method is provided in sdp.processors.base_processor.BaseParallelProcessor.test()
, which
checks that for a given input data entry, the output data entry/entries produced by the processor
will match the expected output data entry/entries. Note that this essentially only checks that the
impact on the data manifest will be as expected. If you want to do some other checks, you will need
to override this test method.
The input data entry and the expected output data entry/entries for
sdp.processors.base_processor.BaseParallelProcessor.test()
are specified inside the optional list
of test_cases
that were provided in the object constructor.
This means you can provided test cases in the YAML config file, and the
dataset will only be processed if the test cases pass.
This is helpful to (a) make sure that the rules you wrote have the effect you desired, and (b) demonstrate why you wrote those rules. An example of test cases we could include in the YAML config file:
- _target_: sdp.processors.DropIfRegexMatch
regex_patterns:
- "(\\D ){5,20}" # looks for between 4 and 19 characters surrounded by spaces
test_cases:
- {input: {text: "some s p a c e d out letters"}, output: null}
- {input: {text: "normal words only"}, output: {text: "normal words only"}}