API

Contents

API#

Available processors#

Here is the full list of all available processors and their supported arguments.

Note

All SDP processors optionally accept input_manifest_file and output_manifest_file keys. See Special fields section for more details.

Dataset-specific processors#

MCV#

sdp.processors.CreateInitialManifestMCV[source]#

Processor to create initial manifest for the Mozilla Common Voice (MCV) dataset.

Dataset link: https://commonvoice.mozilla.org/

Extracts raw MCV data for the specified language and creates an initial manifest using the transcripts provided in the raw data.

Parameters:
  • raw_data_dir (str) – the path to the directory containing the raw data archive file. Needs to be manually downloaded from https://commonvoice.mozilla.org/.

  • extract_archive_dir (str) – directory where the extracted data will be saved.

  • resampled_audio_dir (str) – directory where the resampled audio will be saved.

  • data_split (str) – “train”, “dev” or “test”.

  • language_id (str) – the ID of the language of the data. E.g., “en”, “es”, “it”, etc.

  • already_extracted (bool) – if True, we will not try to extract the raw data. Defaults to False.

  • target_samplerate (int) – sample rate (Hz) to use for resampling. Defaults to 16000.

  • target_nchannels (int) – number of channels to create during resampling process. Defaults to 1.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "duration": <duration of the audio in seconds>,
    "text": <transcription (with capitalization and punctuation)>,
}

MLS#

sdp.processors.CreateInitialManifestMLS[source]

Processor to create initial manifest for the Multilingual LibriSpeech (MLS) dataset.

Dataset link: https://www.openslr.org/94/

Downloads and unzips raw MLS data for the specified language, and creates an initial manifest using the transcripts provided in the raw data.

Parameters:
  • raw_data_dir (str) – the directory where the downloaded data will be/is saved. This is also where the extracted and processed data will be.

  • language (str) – the language of the data you wish to be downloaded. This will be used to format the URL from which we attempt to download the data. E.g., “english”, “italian”, “spanish”, etc.

  • data_split (str) – “train”, “dev” or “test”.

  • resampled_audio_dir (str or None) – if specified, the directory where the resampled wav files will be stored. If not specified, the audio will not be resampled and the parameters target_samplerate and target_nchannels will be ignored.

  • target_samplerate (int) – sample rate (Hz) to use for resampling. This parameter will be ignored if resampled_audio_dir is None. Defaults to 16000.

  • target_nchannels (int) – number of channels to create during resampling process. This parameter will be ignored if resampled_audio_dir is None. Defaults to 1.

  • use_opus_archive (bool) – if True, will use the version of the archive file which contains audio files saved in the OPUS format, instead of FLAC. The OPUS files take up less memory than the FLAC files, at the cost of the OPUS files being lower quality than the FLAC files. If True, the parameter resampled_audio_dir must be None, as resampling OPUS audio files is currently not supported. Defaults to False.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "duration": <duration of the audio in seconds>,
    "text": <transcription>,
}

sdp.processors.RestorePCForMLS[source]#

Recovers original text from the MLS Librivox texts.

This processor can be used to restore punctuation and capitalization for the MLS data. Uses the original data in https://dl.fbaipublicfiles.com/mls/lv_text.tar.gz. Saves recovered text in restored_text_field field. If text was not recovered, restored_text_field will be equal to n/a.

Parameters:
  • language_long (str) – the full name of the language, used for choosing the folder of the contents of “https://dl.fbaipublicfiles.com/mls/lv_text.tar.gz”. E.g., “english”, “spanish”, “italian”, etc.

  • language_short (str or None) – the short name of the language, used for specifying the normalizer we want to use. E.g., “en”, “es”, “it”, etc. If set to None, we will not try to normalize the provided Librivox text.

  • lv_text_dir (str) – the directory where the contents of https://dl.fbaipublicfiles.com/mls/lv_text.tar.gz will be saved.

  • submanifests_dir (str) – the directory where submanifests (one for each combo of speaker + book) will be stored.

  • restored_submanifests_dir (str) – the directory where restored submanifests (one for each combo of speaker + book) will be stored.

  • restored_text_field (str) – the field where the recovered text will be stored.

  • n_jobs (int) – number of jobs to use for parallel processing. Defaults to -1.

  • show_conversion_breakdown (bool) – whether to show how much of each submanifest was restored. Defaults to True.

Returns:

All the same data as in the input manifest with an additional key:

<restored_text_field>: <restored text or n/a if match was not found>``

VoxPopuli#

sdp.processors.CreateInitialManifestVoxpopuli[source]#

Processor to create initial manifest for the VoxPopuli dataset.

Dataset link: facebookresearch/voxpopuli

Downloads and unzips raw VoxPopuli data for the specified language, and creates an initial manifest using the transcripts provided in the raw data.

Note

This processor will install a couple of Python packages, including PyTorch, so it might be a good idea to run it in an isolated Python environment.

Parameters:
  • raw_data_dir (str) – the directory where the downloaded data will be/is saved.

  • language_id (str) – the language of the data you wish to be downloaded. E.g., “en”, “es”, “it”, etc.

  • data_split (str) – “train”, “dev” or “test”.

  • resampled_audio_dir (str) – the directory where the resampled wav files will be stored.

  • target_samplerate (int) – sample rate (Hz) to use for resampling. Defaults to 16000.

  • target_nchannels (int) – number of channels to create during resampling process. Defaults to 1.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "duration": <duration of the audio in seconds>,
    "text": <transcription (with provided normalization)>,
    "raw_text": <original transcription (without normalization)>,
    "speaker_id": <speaker id>,
    "gender": <speaker gender>,
    "age": <speaker age>,
    "is_gold_transcript": <whether the transcript has been verified>,
    "accent": <speaker accent, if known>,
}

sdp.processors.NormalizeFromNonPCTextVoxpopuli[source]#

Tries to restore punctuation and capitalization from the un-normalized text version.

VoxPopuli contains two versions of the transcription - original (non-normalized, but with punctuation and capitalization) and normalized (without punctuation or capitalization), but with digits and other forms normalized. This processor can be used to map the normalized and non-normalized versions and produce a normalized version with restored punctuation and capitalization.

Note

The current map logic is highly heuristical and might not work for all languages. The processor will return n/a for any text it was not able to restore, so make sure you check how much data was removed and consider updating the heuristics to retain more data.

Parameters:
  • restored_text_field (str) – the field where the recovered text (or n/a) will be stored. Defaults to “text”.

  • raw_text_key (str) – which field contains the original text without normalization. Defaults to “raw_text”.

  • norm_text_key (str) – which field contains the normalized text. Defaults to “provided_norm_text”.

Returns:

All the same data as in the input manifest with an additional key:

<restored_text_field>: <restored text or n/a if mapping failed>``

CORAAL#

sdp.processors.CreateInitialManifestCORAAL[source]#

Processor to create initial manifest for the Corpus of Regional African American Language (CORAAL) dataset.

Dataset link: https://oraal.uoregon.edu/coraal/

Will download all files, extract tars and split wav files based on the provided durations in the transcripts.

Parameters:
  • raw_data_dir (str) – where to put raw downloaded data.

  • resampled_audio_dir (str) – where to put re-sampled and trimmed wav files.

  • target_samplerate (int) – sample rate to resample to. Defaults to 16000.

  • target_nchannels (int) – target number of channels. Defaults to 1.

  • drop_pauses (bool) – if True, will drop all transcriptions that contain only silence (indicated by (pause X) in the transcript). Defaults to True.

  • group_duration_threshold (float) – can be used to group consecutive utterances from the same speaker to a longer duration. Set to 0 to disable this grouping (but note that many utterances are transcribed with only a few seconds, so grouping is generally advised). Defaults to 20.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "duration": <duration of the audio in seconds>,
    "text": <transcription>,
    "original_file": <name of the original file in the dataset this audio came from>,
    "speaker": <speaker id>,
    "is_interviewee": <whether this is an interviewee (accented speech)>,
    "gender": <speaker gender>,
    "age": <speaker age>,
    "education": <speaker education>,
    "occupation": <speaker occupation>,
}

sdp.processors.TrainDevTestSplitCORAAL[source]#

Custom train-dev-test split for CORAAL dataset.

Split is done speaker-wise, so the same speakers don’t appear in different splits.

Parameters:

data_split (str) – train, dev or test.

Returns:

All the same fields as in the input manifest, but only a subset of the data is retained.

Librispeech#

sdp.processors.CreateInitialManifestLibrispeech[source]#

Processor to create initial manifest for the Librispeech dataset.

Dataset link: https://openslr.org/12

Will download all files, extract tars, and create a manifest file with the “audio_filepath” and “text” fields.

Parameters:
  • splits (list[str]) –

    Which datasets or their combinations should be processed. Options are:

    • ["dev-clean"]

    • ["dev-other"]

    • ["test-clean"]

    • ["test-other"]

    • ["train-clean-100"]

    • ["train-clean-360"]

    • ["train-other-500"]

    • ["all"] (for all datasets available)

  • raw_data_dir (str) – Path to the folder where the data archive should be downloaded and extracted.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "text": <transcription>,
}

SLR83#

sdp.processors.CreateInitialManifestSLR83[source]#

Processor to create initial manifest for the SLR83 dataset.

This is a dataset introduced in Open-source Multi-speaker Corpora of the English Accents in the British Isles.

Parameters:
  • raw_data_dir (str) – where to put raw downloaded data.

  • dialect (str) –

    should be one of the

    • irish_english_male

    • midlands_english_female

    • midlands_english_male

    • northern_english_female

    • northern_english_male

    • scottish_english_female

    • scottish_english_male

    • southern_english_female

    • southern_english_male

    • welsh_english_female

    • welsh_english_male

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "duration": <duration of the audio in seconds>,
    "text": <transcription>,
}

sdp.processors.CustomDataSplitSLR83[source]#

Splits SLR83 data into train, dev or test subset.

The original paper does not provide train/dev/test splits, so we include a custom processing that can be used as a standardized split to compare results. For more details on this data split see Damage Control During Domain Adaptation for Transducer Based Automatic Speech Recognition.

Note

All data dropping has to be done before the split. We will check the total number of files to be what is expected in the reference split. But if you add any custom pre-processing that changes duration or number of files, your splits will likely be different.

Parameters:
Returns:

All the same fields as in the input manifest, but only a subset of the data is retained.

MTEDx ‘’’

sdp.processors.CreateInitialManifestMTEDX[source]#

Processor to create initial manifest for the Multilingual TEDx (MTedX dataset.

Dataset link: https://www.openslr.org/100/

Downloads dataset for the specified language and creates initial manifest with the provided audio and vtt files.

Parameters:
  • raw_data_dir (str) – the directory where the downloaded data will be/is saved. This is also where the extracted and processed data will be.

  • data_split (str) – “train”, “dev” or “test”.

  • language_id (str) – the ID of the language of the data. E.g., “en”, “es”, “it”, etc.

  • target_samplerate (int) – sample rate (Hz) to use for resampling.

  • already_extracted – (bool): if True, we will not try to extract the raw data. Defaults to False.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "vtt_filepath": <path to the corresponding vtt file>
    "duration": <duration of the audio in seconds>
}

Coraa ‘’’

sdp.processors.CreateInitialManifestCORAA[source]#

Processor to create initial manifest file fo CORAA ASR dataset

Dataset link: nilc-nlp/CORAA

Parameters:
  • raw_data_dir (str) – the path to the directory in which all the data will be downloaded.

  • extract_archive_dir (str) – directory where the extracted data will be saved.

  • data_split (str) – “train”, “dev” or “test”.

  • resampled_audio_dir (str) – the directory where the resampled wav files will be stored.

  • already_extracted (bool) – if True, we will not try to extract the raw data. Defaults to False.

  • already_downloaded (bool) – if True, we will not try to download files.

  • target_samplerate (int) – sample rate (Hz) to use for resampling. This parameter will Defaults to 16000.

  • target_nchannels (int) – number of channels to create during resampling process. Defaults to 1.

  • exclude_dataset – list: list of the dataset names that will be excluded when creating initial manifest. Options ‘SP2010’, ‘C-ORAL-BRASIL I’, ‘NURC-Recife’, ‘TEDx Talks’, ‘ALIP’

FLEURS#

sdp.processors.CreateInitialManifestFleurs[source]#

Processor to create initial manifest for the FLEURS dataset.

Dataset link: https://huggingface.co/datasets/google/fleurs

Will download all files, extract them, and create a manifest file with the “audio_filepath” and “text” fields.

Parameters:
  • lang (str) –

    Language to be processed, identified by a combination of ISO 639-1 and ISO 3166-1 alpha-2 codes. Examples are:

    • "hy_am" for Armenian

    • "ko_kr" for Korean

  • split (str) –

    Which dataset splits to process. Options are:

    • "test"

    • "train"

    • "dev"

  • raw_data_dir (str) – Path to the folder where the data archive should be downloaded and extracted.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "text": <transcription>,
}

Lhotse processors#

The following processors leverage Lhotse, a speech data handling library that contains data preparation recipes for 80+ publicly available datasets. Lhotse has its own data manifest format that can be largely mapped into NeMo’s format.

sdp.processors.LhotseImport[source]#

Processor to create an initial manifest imported from a Lhotse CutSet. The input_manifest_file is expected to point to a Lhotse CutSet manifest, which usually has cuts in its name and a .jsonl or .jsonl.gz extension.

Lhotse is a library for speech data processing and loading; see:

It can be installed using pip install lhotse.

Caution

Currently we only support the importing of cut sets that represent single-channel, single-audio-file-per-utterance datasets.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "duration": <duration of the audio in seconds>,
    "text": <transcription (with capitalization and punctuation)>,
}

Data enrichment#

The following processors can be used to add additional attributes to the data by running different NeMo models (e.g., ASR predictions). These attributes are typically used in the downstream processing for additional enhancement or filtering.

sdp.processors.ASRInference[source]#

This processor performs ASR inference on each utterance of the input manifest.

ASR predictions will be saved in the pred_text key.

Parameters:
  • pretrained_model (str) – the name of the pretrained NeMo ASR model which will be used to do inference.

  • batch_size (int) – the batch size to use for ASR inference. Defaults to 32.

Returns:

The same data as in the input manifest with an additional field pred_text containing ASR model’s predictions.

sdp.processors.PCInference[source]#

Adds predictions of a text-based punctuation and capitalization (P&C) model.

Operates on the text in the input_text_field, and saves predictions in the output_text_field.

Parameters:
  • input_text_field (str) – the text field that will be the input to the P&C model.

  • output_text_field (str) – the text field where the output of the PC model will be saved.

  • batch_size (int) – the batch sized used by the P&C model.

  • device (str) – the device used by the P&C model. Can be skipped to auto-select.

  • pretrained_name (str) – the pretrained_name of the P&C model.

  • model_path (str) – the model path to the P&C model.

Note

Either pretrained_name or model_path have to be specified.

Returns:

The same data as in the input manifest with an additional field <output_text_field> containing P&C model’s predictions.

sdp.processors.ASRWhisper[source]#

Simple example to transcribe using ASR Whisper model from HuggingFace. There are many ways to improve it: make batch inference, split long files, return predicted language, etc.

Parameters:
  • pretrained_model (str) – name of pretrained model on HuggingFace.

  • output_text_field (str) – field to save transcription result.

  • pad_or_trim_length (int) – Audio duration to pad or trim (number of samples). Counted as sample_rate * n_seconds i.e.: 16000*30=480000

  • device (str) – Inference device.

sdp.processors.ASRTransformers[source]#

Processor to transcribe using ASR Transformers model from HuggingFace.

Parameters:
  • pretrained_model (str) – name of pretrained model on HuggingFace.

  • output_text_key (str) – Key to save transcription result.

  • input_audio_key (str) – Key to read audio file. Defaults to “audio_filepath”.

  • input_duration_key (str) – Audio duration key. Defaults to “duration”.

  • device (str) – Inference device.

  • batch_size (int) – Inference batch size. Defaults to 1.

  • torch_dtype (str) – Tensor data type. Default to “float32”

Text-only processors#

Note

All processors in this section accept additional parameter text_key (defaults to “text”) to control which field is used for modifications/filtering.

sdp.processors.ReadTxtLines[source]#

The text file specified in source_filepath will be read, and each line in it will be added as a line in the output manifest, saved in the field text_key.

Parameters:
  • input_file_key (str) – The key in the manifest containing the input txt file path .

  • text_key (str) – The key to store the read text lines in the manifest.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

Data modifications#

sdp.processors.SubRegex[source]

Converts a regex match to a string, as defined by key-value pairs in regex_to_sub.

Before applying regex changes, we will add a space character to the beginning and end of the text and pred_text keys for each data entry. After the the regex changes, the extra spaces are removed. This includes the spaces in the beginning and end of the text, as well as any double spaces "  ".

Parameters:
  • regex_params_list (list[dict]) – list of dicts. Each dict must contain a pattern and a repl key, and optionally a count key (by default, count will be 0). This processor will go through the list in order, and apply a re.sub operation on the input text in data_entry[self.text_key], feeding in the specified pattern, repl and count parameters to re.sub.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

Returns:

The same data as in the input manifest with <text_key> field changed.

sdp.processors.SubMakeLowercase[source]#

Processor to convert text to lowercase.

text_key (str): a string indicating which key of the data entries

should be used to find the utterance transcript. Defaults to “text”.

Returns:

The same data as in the input manifest with <text_key> field changed.

sdp.processors.MakeLettersUppercaseAfterPeriod[source]#

Can be used to replace characters with upper-case version after punctuation.

Parameters:
  • punctuation (str) – string with all punctuation characters to consider. Defaults to “.!?”.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

Returns:

The same data as in the input manifest with <text_key> field changed.

sdp.processors.SplitLineBySentence[source]#

Processor for splitting lines of text into sentences based on a specified pattern. One line containing N sentences will be transformed into N lines containing one sentence.

Parameters:
  • text_key (str) – The field containing the text lines in the dataset.

  • end_pattern (str) – The regular expression pattern to identify sentence boundaries.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

sdp.processors.CountNumWords[source]#

Processor for counting the number of words in the text_key field saving the number in num_words_key.

Parameters:
  • text_key (str) – The field containing the input text in the dataset.

  • num_words_key (str) – The field to store the number of words in the dataset.

  • alphabet (str) – Characters to be used to count words. Any other characters are substituted by whitespace and not take into account.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

Data filtering#

sdp.processors.DropIfRegexMatch[source]#

Drops utterances if text matches a regex pattern.

Before applying regex checks, we will add a space character to the beginning and end of the text and pred_text keys for each data entry. After the the regex checks, assuming the utterance isn’t dropped, the extra spaces are removed. This includes the spaces in the beginning and end of the text, as well as any double spaces "  ".

Parameters:
  • regex_patterns (list[str]) – a list of strings. The list will be traversed in order. If data_entry.data[self.text_key] matches the regex, the entry will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropIfNoneOfRegexMatch[source]#

Drops utterances if data[self.text_key] does not match any of regex_patterns.

Before applying regex checks, we will add a space character to the beginning and end of the text and pred_text keys for each data entry. After the the regex checks, assuming the utterance isn’t dropped, the extra spaces are removed. This includes the spaces in the beginning and end of the text, as well as any double spaces "  ".

Parameters:
  • regex_patterns (list[str]) – If data_entry[self.text_key] does not match any of the regex patterns in the list, that utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropNonAlphabet[source]#

Drops utterances if they contain characters that are not in the alphabet.

Parameters:
  • alphabet (str) – a string containing all of the characters in our alphabet. If an utterance contains at least one character that is not in the alphabet, then that utterance will be dropped.

  • text_key (str) –

    a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

    Note

    Don’t forget to include spaces in your alphabet, unless you want to make sure none of the utterances contain spaces.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropOnAttribute[source]#

Drops utterances if attribute is set to True/False.

Parameters:
  • key (str) – which key to use for dropping utterances.

  • drop_if_false (bool) – whether to drop if value is False. Defaults to dropping if True.

Returns:

The same data as in the input manifest with some entries dropped.

ASR-based processors#

Note

All processors in this section depend on the sdp.processors.ASRInference. So make sure to include it in the config at some prior stage with an applicable ASR model.

Note

All processors in this section accept additional parameters text_key (defaults to “text”) and pred_text_key (defaults to “text_pred”) to control which fields contain transcription and ASR model predictions.

Data modifications#

sdp.processors.SoxConvert[source]#

Processor for converting audio files from one format to another using Sox, and updating the dataset with the path to the converted audio files.

Parameters:
  • converted_audio_dir (str) – Directory to store the converted audio files.

  • input_audio_file_key (str) – Field in the dataset representing the path to input audio files.

  • output_audio_file_key (str) – Field to store the path to the converted audio files in the dataset.

  • output_format (str) – Format of the output audio files (e.g., ‘wav’, ‘mp3’).

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

sdp.processors.InsIfASRInsertion[source]#

Processor that adds substrings to transcription if they are present in ASR predictions.

Will insert substrings into data[self.text_key] if it is present at that location in data[self.pred_text_key]. It is useful if words are systematically missing from ground truth transcriptions.

Parameters:
  • insert_words (list[str]) – list of strings that will be inserted into data[self.text_key] if there is an insertion (containing only that string) in data[self.pred_text_key].

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) –

    a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

    Note

    Because this processor looks for an exact match in the insertion, we recommend including variations with different spaces in insert_words, e.g. [' nemo', 'nemo ', ' nemo '].

Returns:

The same data as in the input manifest with <text_key> field changed.

sdp.processors.SubIfASRSubstitution[source]#

Processor that substitutes substrings to transcription if they are present in ASR predictions.

Will convert a substring in data[self.text_key] to a substring in data[self.pred_text_key] if both are located in the same place (ie are part of a ‘substitution’ operation) and if the substrings correspond to key-value pairs in sub_words. This is useful if words are systematically incorrect in ground truth transcriptions.

Before starting to look for substitution, this processor adds spaces at the beginning and end of data[self.text_key] and data[self.pred_text_key], to ensure that an argument like sub_words = {"nmo ": "nemo "} would cause a substitution to be made even if the original data[self.text_key] ends with "nmo" and data[self.pred_text_key] ends with "nemo".

Parameters:
  • sub_words (dict) – dictionary where a key is a string that might be in data[self.text_key] and the value is the string that might be in data[self.pred_text_key]. If both are located in the same place (i.e. are part of a ‘substitution’ operation) then the key string will be converted to the value string in data[self.text_key].

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) –

    a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

    Note

    This processor looks for exact string matches of substitutions, so you may need to be careful with spaces in sub_words. E.g. it is recommended to do sub_words = {"nmo ": "nemo "} instead of sub_words = {"nmo" : "nemo"}.

Returns:

The same data as in the input manifest with <text_key> field changed.

Data filtering#

sdp.processors.PreserveByValue[source]#

Processor for preserving dataset entries based on a specified condition involving a target value and an input field.

Parameters:
  • input_value_key (str) – The field in the dataset entries to be evaluated.

  • target_value (Union[int, str]) – The value to compare with the input field.

  • operator (str) – (Optional) The operator to apply for comparison. Options: “lt” (less than), “le” (less than or equal to), “eq” (equal to), “ne” (not equal to), “ge” (greater than or equal to), “gt” (greater than). Defaults to “eq”.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

sdp.processors.DropASRError[source]#

Drops utterances if there is a sufficiently long ASR mismatch anywhere in the utterance.

Parameters:
  • consecutive_words_threshold (int) – will drop if there is a mismatch of at least this many words in a row.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropASRErrorBeginningEnd[source]#

Drops utterances if there is a sufficiently long ASR mismatch at the beginning or end of the utterance.

Parameters:
  • beginning_error_char_threshold (int) – if there is an insertion or deletion at the beginning of the utterance that has more characters than this number, then the utterance will be dropped. If there is a substitution at the beginning of the utterance, then the utterance will be dropped if abs(len(deletion) - len(insertion)) > beginning_error_char_threshold.

  • end_error_char_threshold (int) – if there is an insertion or deletion at the end of the utterance that has more characters than this number, then the utterance will be dropped. If there is a substitution at the end of the utterance, then the utterance will be dropped if abs(len(deletion) - len(insertion)) > end_error_char_threshold.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropIfSubstringInInsertion[source]#

Drops utterances if a substring matches an ASR insertion.

Insertions are checked between data[self.text_key] and data[self.pred_text_key].

Note

We check for exact matches, so you need to be mindful of spaces, e.g. you may wish to do substrings_in_insertion = ["nemo "] instead of substrings_in_insertion = ["nemo"].

Parameters:
  • substrings_in_insertion (list[str]) – a list of strings which might be inserted in predicted ASR text. If the insertion matches a string exactly, the utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropHighCER[source]#

Drops utterances if there is a sufficiently high character-error-rate (CER).

CER is measured between data[self.text_key] and data[self.pred_text_key].

Note

We only drop the utterance if CER > threshold (i.e. strictly greater than) so that if we set the threshold to 0, we will not remove utterances with CER == 0.

Parameters:
  • cer_threshold (float) – CER threshold above which the utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropHighWER[source]#

Drops utterances if there is a sufficiently high word-error-rate (WER).

WER is measured between data[self.text_key] and data[self.pred_text_key].

Note

We only drop the utterance if WER > threshold (i.e. strictly greater than) so that if we set the threshold to 0, we will not remove utterances with WER == 0.

Parameters:
  • wer_threshold (float) – WER threshold above which the utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropLowWordMatchRate[source]#

Drops utterances if there is a sufficiently low word-match-rate (WMR).

WMR is measured between data[self.text_key] and data[self.pred_text_key].

Note

We only drop the utterance if WMR < threshold (i.e. strictly lower than) so that if we set the threshold to 100, we will not remove utterances with WMR == 100.

Parameters:
  • wmr_threshold (float) – WMR threshold below which the utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropHighLowCharrate[source]

Drops utterances if their character rate is too low or too high.

Character rate = (num of characters in self.text_key) / (duration of audio). A too-low or too-high character rate often implies that the ground truth transcription might be inaccurate.

Parameters:
  • high_charrate_threshold (float) – upper character rate threshold. If the character rate of an utterance is higher than this number, the utterance will be dropped.

  • low_charrate_threshold (float) – lower character rate threshold. If the character rate of an utterance is lower than this number, the utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropHighLowWordrate[source]#

Drops utterances if their word rate is too low or too high.

Word rate = (num of words in self.text_key) / (duration of audio). A too-low or too-high word rate often implies that the ground truth transcription might be inaccurate.

Parameters:
  • high_wordrate_threshold (float) – upper word rate threshold. If the word rate of an utterance is higher than this number, the utterance will be dropped.

  • low_wordrate_threshold (float) – lower word rate threshold. If the word rate of an utterance is lower than this number, the utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropHighLowDuration[source]#

Drops utterances if their duration is too low or too high.

Parameters:
  • high_duration_threshold (float) – upper duration threshold (in seconds). If the duration of an utterance’s audio is higher than this number, the utterance will be dropped.

  • low_duration_threshold (float) – lower duration threshold (in seconds). If the duration of an utterance’s audio is lower than this number, the utterance will be dropped.

  • duration_key (str) – a string indicating which key of the data entries should be used to find the utterance duration. Defaults to “duration”.

Returns:

The same data as in the input manifest with some entries dropped.

Miscellaneous#

sdp.processors.AddConstantFields[source]#

This processor adds constant fields to all manifest entries.

E.g., can be useful to add fixed label: <language> field for downstream language identification model training.

Parameters:

fields

dictionary with any additional information to add. E.g.:

fields = {
    "label": "en",
    "metadata": "mcv-11.0-2022-09-21",
}

Returns:

The same data as in the input manifest with added fields as specified in the fields input dictionary.

sdp.processors.CombineSources[source]#

Can be used to create a single field from two alternative sources.

E.g.:

_target_: sdp.processors.CombineSources
sources:
    - field: text_pc
      origin_label: original
    - field: text_pc_pred
      origin_label: synthetic
    - field: text
      origin_label: no_pc
target: text

will populate the text field with data from text_pc field if it’s present and not equal to n/a (can be customized). If text_pc is not available, it will populate text from text_pc_pred field, following the same rules. If both are not available, it will fall back to the text field itself. In all cases it will specify which source was used in the text_origin field by using the label from the origin_label field.. If non of the sources is available, it will populate both the target and the origin fields with n/a.

Parameters:
  • sources (list[dict]) –

    list of the sources to use in order of preference. Each element in the list should be in the following format:

    {
        field: <which field to take the data from>
        origin_label: <what to write in the "<target>_origin"
    }
    

  • target (str) – target field that we are populating.

  • na_indicator (str) – if any source field has text equal to the na_indicator it will be considered as not available. If none of the sources are present, this will also be used as the value for the target and origin fields. Defaults to n/a.

Returns:

The same data as in the input manifest enhanced with the following fields:

<target>: <populated with data from either <source1> or <source2>                        or with <na_indicator> if none are available>
<target>_origin: <label that marks where the data came from>

sdp.processors.DuplicateFields[source]#

This processor duplicates fields in all manifest entries.

It is useful for when you want to do downstream processing of a variant of the entry. E.g. make a copy of “text” called “text_no_pc”, and remove punctuation from “text_no_pc” in downstream processors.

Parameters:

duplicate_fields (dict) – dictionary where keys are the original fields to be copied and their values are the new names of the duplicate fields.

Returns:

The same data as in the input manifest with duplicated fields as specified in the duplicate_fields input dictionary.

sdp.processors.RenameFields[source]#

This processor renames fields in all manifest entries.

Parameters:

rename_fields – dictionary where keys are the fields to be renamed and their values are the new names of the fields.

Returns:

The same data as in the input manifest with renamed fields as specified in the rename_fields input dictionary.

sdp.processors.SplitOnFixedDuration[source]#

This processor splits audio into a fixed length segments.

It does not actually create different audio files, but simply adds corresponding offset and duration fields. These fields can be automatically processed by NeMo to split audio on the fly during training.

Parameters:
  • segment_duration (float) – fixed desired duration of each segment.

  • drop_last (bool) – whether to drop the last segment if total duration is not divisible by desired segment duration. If False, the last segment will be of a different length which is < segment_duration. Defaults to True.

  • drop_text (bool) – whether to drop text from entries as it is most likely inaccurate after the split on duration. Defaults to True.

Returns:

The same data as in the input manifest but all audio that’s longer than the segment_duration will be duplicated multiple times with additional offset and duration fields. If drop_text=True will also drop text field from all entries.

sdp.processors.ChangeToRelativePath[source]#

This processor changes the audio filepaths to be relative.

Parameters:

base_dir – typically a folder where manifest file is going to be stored. All passes will be relative to that folder.

Returns:

The same data as in the input manifest with audio_filepath key changed to contain relative path to the base_dir.

sdp.processors.SortManifest[source]#

Processor which will sort the manifest by some specified attribute.

Parameters:
  • attribute_sort_by (str) – the attribute by which the manifest will be sorted.

  • descending (bool) – if set to False, attribute will be in ascending order. If True, attribute will be in descending order. Defaults to True.

Returns:

The same entries as in the input manifest, but sorted based on the provided parameters.

sdp.processors.KeepOnlySpecifiedFields[source]#

Saves a copy of a manifest but only with a subset of the fields.

Typically will be the final processor to save only relevant fields in the desired location.

Parameters:

fields_to_keep (list[str]) – list of the fields in the input manifest that we want to retain. The output file will only contain these fields.

Returns:

The same data as in input manifest, but re-saved in the new location with only fields_to_keep fields retained.

sdp.processors.GetAudioDuration[source]#

Processor that computes the duration of the file in audio_filepath_key (using soundfile) and saves the duration in duration_key. If there is an error computing the duration, the value at duration_key will be updated with the value -1.0.

Parameters:
  • audio_filepath_key (str) – Key to get path to wav file.

  • duration_key (str) – Key to put to audio duration.

Returns:

All the same fields as in the input manifest plus duration_key

sdp.processors.FfmpegConvert[source]#

Processor for converting video or audio files to audio using FFmpeg and updating the dataset with the path to the resampled audio. If id_key is not None, the output file path will be <resampled_audio_dir>/<id_key>.wav. If id_key is None, the output file path will be <resampled_audio_dir>/<input file name without extension>.wav.

Note

id_key can be used to create subdirectories inside resampled_audio_dir (by using forward slashes /). e.g. if id_key takes the form dir_name1/dir_name2/filename, the output file path will be

<resampled_audio_dir>/dir_name1/dirname2/filename.wav.

Parameters:
  • converted_audio_dir (str) – The directory to store the resampled audio files.

  • input_file_key (str) – The field in the dataset representing the path to the input video or audio files.

  • output_file_key (str) – The field in the dataset representing the path to the resampled audio files with output_format. If id_key is None, the output file path will be <resampled_audio_dir>/<input file name without extension>.wav.

  • id_key (str) – (Optional) The field in the dataset representing the unique ID or identifier for each entry. If id_key is not None, the output file path will be <resampled_audio_dir>/<id_key>.wav. Defaults to None.

  • output_format (str) – (Optional) Format of the output audio files. Defaults to wav.

  • target_samplerate (int) – (Optional) The target sampling rate for the resampled audio. Defaults to 16000.

  • target_nchannels (int) – (Optional) The target number of channels for the resampled audio. Defaults to 1.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

sdp.processors.CreateInitialManifestByExt[source]#

Processor for creating an initial dataset manifest by saving filepaths with a common extension to the field specified in output_field.

Parameters:
  • raw_data_dir (str) – The root directory of the files to be added to the initial manifest. This processor will recursively look for files with the extension ‘extension’ inside this directory.

  • output_file_key (str) – The key to store the paths to the files in the dataset.

  • extension (str) – The file extension of the of the files to be added to the manifest.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

Base classes#

This section lists all the base classes you might need to know about if you want to add new SDP processors.

BaseProcessor#

class sdp.processors.base_processor.BaseProcessor(output_manifest_file: str, input_manifest_file: str | None = None)[source]#

Bases: ABC

Abstract class for SDP processors.

All processor classes inherit from the BaseProcessor class. This is a simple abstract class which has 2 empty methods: process() and test().

These serve to remind us that SDP essentially just runs .test() on all processors (to implement run-time tests), and then .process() on all processors.

Parameters:
  • output_manifest_file (str) – path of where the output manifest file will be located. Cannot have the same value as input_manifest_file.

  • input_manifest_file (str) – path of where the input manifest file is located. This arg is optional - some processors may not take in an input manifest because they need to create an initial manifest from scratch (ie from some transcript file that is in a format different to the NeMo manifest format). Cannot have the same value as input_manifest_file.

abstract process()[source]#

Should be overriden by the child classes to implement some data processing.

test()[source]#

This method can be used to perform “runtime” tests.

This can be any kind of self-consistency tests, but are usually in the form of checking that provided input test data entries match provided output test data entries.

There are not tests by default.

BaseParallelProcessor#

class sdp.processors.base_processor.BaseParallelProcessor(max_workers: int = -1, chunksize: int = 100, in_memory_chunksize: int = 1000000, test_cases: List[Dict] | None = None, **kwargs)[source]#

Bases: BaseProcessor

Processor class which allows operations on each utterance to be parallelized.

Parallelization is done using tqdm.contrib.concurrent.process_map inside the process() method. Actual processing should be defined on a per-examples bases inside the process_dataset_entry() method.

See the documentation of all the methods for more details.

Parameters:
  • max_workers (int) – maximum number of workers that will be spawned during the parallel processing.

  • chunksize (int) – the size of the chunks that will be sent to worker processes during the parallel processing.

  • in_memory_chunksize (int) – the maximum number of input data entries that will be read, processed and saved at a time.

  • test_cases (list[dict]) – an optional list of dicts containing test cases for checking that the processor makes the changes that we are expecting. The dicts must have a key input, the value of which is a dictionary containing data which is our test’s input manifest line, and a key output, the value of which is a dictionary containing data which is the expected output manifest line.

process()[source]#

Parallelized implementation of the data processing.

The execution flow of this method is the following.

  1. prepare() is called. It’s empty by default but can be used to e.g. download the initial data files or compute some aggregates required for subsequent processing.

  2. A for-loop begins that loops over all manifest_chunk lists yielded by the _chunk_manifest() method. _chunk_manifest() reads data entries yielded by read_manifest() and yields lists containing in_memory_chunksize data entries.

    Inside the for-loop:

    1. process_dataset_entry() is called in parallel on each element of the manifest_chunk list.

    2. All metrics are aggregated.

    3. All output data-entries are added to the contents of output_manifest_file.

    Note:

    • The default implementation of read_manifest() reads an input manifest file and returns a list of dictionaries for each line (we assume a standard NeMo format of one json per line).

    • process_dataset_entry() is called in parallel on each element of the list created in the previous step. Note that you cannot create any new counters or modify the attributes of this class in any way inside that function as this will lead to an undefined behavior. Each call to the process_dataset_entry() returns a list of DataEntry objects that are then aggregated together. DataEntry simply defines a data and metrics keys.

    • If data is set to None, the objects are ignored (metrics are still collected).

  3. All metrics keys that were collected in the for-loop above are passed over to finalize() for any desired metric aggregation and reporting.

Here is a diagram outlining the execution flow of this method:

prepare()[source]#

Can be used in derived classes to prepare the processing in any way.

E.g., download data or compute some aggregates. Will be called before starting processing the data.

_chunk_manifest()[source]#

Splits the manifest into smaller chunks defined by in_memory_chunksize.

read_manifest()[source]#

Reading the input manifest file.

Note

This function should be overridden in the “initial” class creating manifest to read from the original source of data.

abstract process_dataset_entry(data_entry) List[DataEntry][source]#

Needs to be implemented in the derived classes.

Each returned value should be a DataEntry object that will hold a dictionary (or anything else that can be json-serialized) with the actual data + any additional metrics required for statistics reporting. Those metrics can be used in finalize() to prepare for final reporting.

DataEntry is a simple dataclass defined in the following way:

@dataclass
class DataEntry:
    # can be None to drop the entry
    data: Optional[Dict]
    # anything - you'd need to aggregate all
    # values in the finalize method manually
    metrics: Any = None

Note

This method should always return a list of objects to allow a one-to-many mapping. E.g., if you want to cut an utterance into multiple smaller parts, you can return a list of all the produced utterances and they will be handled correctly.

The many-to-one mapping is not currently supported by design of this method (but can still be done if you don’t inherit from this class and process the data sequentially).

Parameters:

data_entry – most often, data_entry will be a dictionary containing items which represent the JSON manifest entry. Sometimes, such as in sdp.processors.CreateInitialManifestMLS, it will be a string containing a line for that utterance from the original raw MLS transcript. In general it is an element of the list returned from the read_manifest() method.

finalize(metrics: List)[source]#

Can be used to output statistics about the processed data.

By default outputs new number of entries/hours.

Parameters:

metrics (list) – a list containing all metrics keys from the data entries returned from the process_dataset_entry() method.

test()[source]#

Applies processing to “test_cases” and raises an error in case of mismatch.

Runtime tests#

Before running the specified processors, SDP runs processor.test() on all specified processors. A test method is provided in sdp.processors.base_processor.BaseParallelProcessor.test(), which checks that for a given input data entry, the output data entry/entries produced by the processor will match the expected output data entry/entries. Note that this essentially only checks that the impact on the data manifest will be as expected. If you want to do some other checks, you will need to override this test method.

The input data entry and the expected output data entry/entries for sdp.processors.base_processor.BaseParallelProcessor.test() are specified inside the optional list of test_cases that were provided in the object constructor. This means you can provided test cases in the YAML config file, and the dataset will only be processed if the test cases pass.

This is helpful to (a) make sure that the rules you wrote have the effect you desired, and (b) demonstrate why you wrote those rules. An example of test cases we could include in the YAML config file:

- _target_: sdp.processors.DropIfRegexMatch
regex_patterns:
    - "(\\D ){5,20}" # looks for between 4 and 19 characters surrounded by spaces
test_cases:
    - {input: {text: "some s p a c e d out letters"}, output: null}
    - {input: {text: "normal words only"}, output: {text: "normal words only"}}