API

Contents

API#

Available processors#

Here is the full list of all available processors and their supported arguments.

Note

All SDP processors optionally accept input_manifest_file and output_manifest_file keys. See Special fields section for more details.

Dataset-specific processors#

MCV#

sdp.processors.CreateInitialManifestMCV[source]#

Processor to create initial manifest for the Mozilla Common Voice (MCV) dataset.

Dataset link: https://commonvoice.mozilla.org/

Extracts raw MCV data for the specified language and creates an initial manifest using the transcripts provided in the raw data.

Parameters:
  • raw_data_dir (str) – the path to the directory containing the raw data archive file. Needs to be manually downloaded from https://commonvoice.mozilla.org/.

  • extract_archive_dir (str) – directory where the extracted data will be saved.

  • resampled_audio_dir (str) – directory where the resampled audio will be saved.

  • data_split (str) – “train”, “dev” or “test”.

  • language_id (str) – the ID of the language of the data. E.g., “en”, “es”, “it”, etc.

  • already_extracted (bool) – if True, we will not try to extract the raw data. Defaults to False.

  • target_samplerate (int) – sample rate (Hz) to use for resampling. Defaults to 16000.

  • target_nchannels (int) – number of channels to create during resampling process. Defaults to 1.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "duration": <duration of the audio in seconds>,
    "text": <transcription (with capitalization and punctuation)>,
}

MLS#

sdp.processors.CreateInitialManifestMLS[source]

Processor to create initial manifest for the Multilingual LibriSpeech (MLS) dataset.

Dataset link: https://www.openslr.org/94/

Downloads and unzips raw MLS data for the specified language, and creates an initial manifest using the transcripts provided in the raw data.

Parameters:
  • raw_data_dir (str) – the directory where the downloaded data will be/is saved. This is also where the extracted and processed data will be.

  • language (str) – the language of the data you wish to be downloaded. This will be used to format the URL from which we attempt to download the data. E.g., “english”, “italian”, “spanish”, etc.

  • data_split (str) – “train”, “dev” or “test”.

  • resampled_audio_dir (str or None) – if specified, the directory where the resampled wav files will be stored. If not specified, the audio will not be resampled and the parameters target_samplerate and target_nchannels will be ignored.

  • target_samplerate (int) – sample rate (Hz) to use for resampling. This parameter will be ignored if resampled_audio_dir is None. Defaults to 16000.

  • target_nchannels (int) – number of channels to create during resampling process. This parameter will be ignored if resampled_audio_dir is None. Defaults to 1.

  • use_opus_archive (bool) – if True, will use the version of the archive file which contains audio files saved in the OPUS format, instead of FLAC. The OPUS files take up less memory than the FLAC files, at the cost of the OPUS files being lower quality than the FLAC files. If True, the parameter resampled_audio_dir must be None, as resampling OPUS audio files is currently not supported. Defaults to False.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "duration": <duration of the audio in seconds>,
    "text": <transcription>,
}

sdp.processors.RestorePCForMLS[source]#

Recovers original text from the MLS Librivox texts.

This processor can be used to restore punctuation and capitalization for the MLS data. Uses the original data in https://dl.fbaipublicfiles.com/mls/lv_text.tar.gz. Saves recovered text in restored_text_field field. If text was not recovered, restored_text_field will be equal to n/a.

Parameters:
  • language_long (str) – the full name of the language, used for choosing the folder of the contents of “https://dl.fbaipublicfiles.com/mls/lv_text.tar.gz”. E.g., “english”, “spanish”, “italian”, etc.

  • language_short (str or None) – the short name of the language, used for specifying the normalizer we want to use. E.g., “en”, “es”, “it”, etc. If set to None, we will not try to normalize the provided Librivox text.

  • lv_text_dir (str) – the directory where the contents of https://dl.fbaipublicfiles.com/mls/lv_text.tar.gz will be saved.

  • submanifests_dir (str) – the directory where submanifests (one for each combo of speaker + book) will be stored.

  • restored_submanifests_dir (str) – the directory where restored submanifests (one for each combo of speaker + book) will be stored.

  • restored_text_field (str) – the field where the recovered text will be stored.

  • n_jobs (int) – number of jobs to use for parallel processing. Defaults to -1.

  • show_conversion_breakdown (bool) – whether to show how much of each submanifest was restored. Defaults to True.

Returns:

All the same data as in the input manifest with an additional key:

<restored_text_field>: <restored text or n/a if match was not found>``

VoxPopuli#

sdp.processors.CreateInitialManifestVoxpopuli[source]#

Processor to create initial manifest for the VoxPopuli dataset.

Dataset link: facebookresearch/voxpopuli

Downloads and unzips raw VoxPopuli data for the specified language, and creates an initial manifest using the transcripts provided in the raw data.

Note

This processor will install a couple of Python packages, including PyTorch, so it might be a good idea to run it in an isolated Python environment.

Parameters:
  • raw_data_dir (str) – the directory where the downloaded data will be/is saved.

  • language_id (str) – the language of the data you wish to be downloaded. E.g., “en”, “es”, “it”, etc.

  • data_split (str) – “train”, “dev” or “test”.

  • resampled_audio_dir (str) – the directory where the resampled wav files will be stored.

  • target_samplerate (int) – sample rate (Hz) to use for resampling. Defaults to 16000.

  • target_nchannels (int) – number of channels to create during resampling process. Defaults to 1.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "duration": <duration of the audio in seconds>,
    "text": <transcription (with provided normalization)>,
    "raw_text": <original transcription (without normalization)>,
    "speaker_id": <speaker id>,
    "gender": <speaker gender>,
    "age": <speaker age>,
    "is_gold_transcript": <whether the transcript has been verified>,
    "accent": <speaker accent, if known>,
}

sdp.processors.NormalizeFromNonPCTextVoxpopuli[source]#

Tries to restore punctuation and capitalization from the un-normalized text version.

VoxPopuli contains two versions of the transcription - original (non-normalized, but with punctuation and capitalization) and normalized (without punctuation or capitalization), but with digits and other forms normalized. This processor can be used to map the normalized and non-normalized versions and produce a normalized version with restored punctuation and capitalization.

Note

The current map logic is highly heuristical and might not work for all languages. The processor will return n/a for any text it was not able to restore, so make sure you check how much data was removed and consider updating the heuristics to retain more data.

Parameters:
  • restored_text_field (str) – the field where the recovered text (or n/a) will be stored. Defaults to “text”.

  • raw_text_key (str) – which field contains the original text without normalization. Defaults to “raw_text”.

  • norm_text_key (str) – which field contains the normalized text. Defaults to “provided_norm_text”.

Returns:

All the same data as in the input manifest with an additional key:

<restored_text_field>: <restored text or n/a if mapping failed>``

CORAAL#

sdp.processors.CreateInitialManifestCORAAL[source]#

Processor to create initial manifest for the Corpus of Regional African American Language (CORAAL) dataset.

Dataset link: https://oraal.github.io/coraal

Will download all files, extract tars and split wav files based on the provided durations in the transcripts.

Parameters:
  • raw_data_dir (str) – where to put raw downloaded data.

  • resampled_audio_dir (str) – where to put re-sampled and trimmed wav files.

  • target_samplerate (int) – sample rate to resample to. Defaults to 16000.

  • target_nchannels (int) – target number of channels. Defaults to 1.

  • drop_pauses (bool) – if True, will drop all transcriptions that contain only silence (indicated by (pause X) in the transcript). Defaults to True.

  • group_duration_threshold (float) – can be used to group consecutive utterances from the same speaker to a longer duration. Set to 0 to disable this grouping (but note that many utterances are transcribed with only a few seconds, so grouping is generally advised). Defaults to 20.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "duration": <duration of the audio in seconds>,
    "text": <transcription>,
    "original_file": <name of the original file in the dataset this audio came from>,
    "speaker": <speaker id>,
    "is_interviewee": <whether this is an interviewee (accented speech)>,
    "gender": <speaker gender>,
    "age": <speaker age>,
    "education": <speaker education>,
    "occupation": <speaker occupation>,
}

sdp.processors.TrainDevTestSplitCORAAL[source]#

Custom train-dev-test split for CORAAL dataset.

Split is done speaker-wise, so the same speakers don’t appear in different splits.

Parameters:

data_split (str) – train, dev or test.

Returns:

All the same fields as in the input manifest, but only a subset of the data is retained.

Librispeech#

sdp.processors.CreateInitialManifestLibrispeech[source]#

Processor to create initial manifest for the Librispeech dataset.

Dataset link: https://openslr.org/12 Dataset link: https://openslr.org/31

Will download all files, extract tars, and create a manifest file with the “audio_filepath” and “text” fields.

Parameters:
  • split (str) –

    Which datasets or their combinations should be processed. Options are:

    • "dev-clean"

    • "dev-other"

    • "test-clean"

    • "test-other"

    • "train-clean-100"

    • "train-clean-360"

    • "train-other-500"

    • "dev-clean-2"

    • "train-clean-5"

  • raw_data_dir (str) – Path to the folder where the data archive should be downloaded and extracted.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "text": <transcription>,
}

SLR83#

sdp.processors.CreateInitialManifestSLR83[source]#

Processor to create initial manifest for the SLR83 dataset.

This is a dataset introduced in Open-source Multi-speaker Corpora of the English Accents in the British Isles.

Parameters:
  • raw_data_dir (str) – where to put raw downloaded data.

  • dialect (str) –

    should be one of the

    • irish_english_male

    • midlands_english_female

    • midlands_english_male

    • northern_english_female

    • northern_english_male

    • scottish_english_female

    • scottish_english_male

    • southern_english_female

    • southern_english_male

    • welsh_english_female

    • welsh_english_male

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "duration": <duration of the audio in seconds>,
    "text": <transcription>,
}

sdp.processors.CustomDataSplitSLR83[source]#

Splits SLR83 data into train, dev or test subset.

The original paper does not provide train/dev/test splits, so we include a custom processing that can be used as a standardized split to compare results. For more details on this data split see Damage Control During Domain Adaptation for Transducer Based Automatic Speech Recognition.

Note

All data dropping has to be done before the split. We will check the total number of files to be what is expected in the reference split. But if you add any custom pre-processing that changes duration or number of files, your splits will likely be different.

Parameters:
Returns:

All the same fields as in the input manifest, but only a subset of the data is retained.

MTEDx ‘’’

sdp.processors.CreateInitialManifestMTEDX[source]#

Processor to create initial manifest for the Multilingual TEDx (MTedX dataset.

Dataset link: https://www.openslr.org/100/

Downloads dataset for the specified language and creates initial manifest with the provided audio and vtt files.

Parameters:
  • raw_data_dir (str) – the directory where the downloaded data will be/is saved. This is also where the extracted and processed data will be.

  • data_split (str) – “train”, “dev” or “test”.

  • language_id (str) – the ID of the language of the data. E.g., “en”, “es”, “it”, etc.

  • target_samplerate (int) – sample rate (Hz) to use for resampling.

  • already_extracted – (bool): if True, we will not try to extract the raw data. Defaults to False.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "vtt_filepath": <path to the corresponding vtt file>
    "duration": <duration of the audio in seconds>
}

Coraa ‘’’

sdp.processors.CreateInitialManifestCORAA[source]#

Processor to create initial manifest file fo CORAA ASR dataset

Dataset link: nilc-nlp/CORAA

Parameters:
  • raw_data_dir (str) – the path to the directory in which all the data will be downloaded.

  • extract_archive_dir (str) – directory where the extracted data will be saved.

  • data_split (str) – “train”, “dev” or “test”.

  • resampled_audio_dir (str) – the directory where the resampled wav files will be stored.

  • already_extracted (bool) – if True, we will not try to extract the raw data. Defaults to False.

  • already_downloaded (bool) – if True, we will not try to download files.

  • target_samplerate (int) – sample rate (Hz) to use for resampling. This parameter will Defaults to 16000.

  • target_nchannels (int) – number of channels to create during resampling process. Defaults to 1.

  • exclude_dataset – list: list of the dataset names that will be excluded when creating initial manifest. Options ‘SP2010’, ‘C-ORAL-BRASIL I’, ‘NURC-Recife’, ‘TEDx Talks’, ‘ALIP’

FLEURS#

sdp.processors.CreateInitialManifestFleurs[source]#

Processor to create initial manifest for the FLEURS dataset.

Dataset link: https://huggingface.co/datasets/google/fleurs

Will download all files, extract them, and create a manifest file with the “audio_filepath” and “text” fields.

Parameters:
  • lang (str) –

    Language to be processed, identified by a combination of ISO 639-1 and ISO 3166-1 alpha-2 codes. Examples are:

    • "hy_am" for Armenian

    • "ko_kr" for Korean

  • split (str) –

    Which dataset splits to process. Options are:

    • "test"

    • "train"

    • "dev"

  • raw_data_dir (str) – Path to the folder where the data archive should be downloaded and extracted.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "text": <transcription>,
}

UzbekVoice#

sdp.processors.CreateInitialManifestUzbekvoice[source]#

Processor to create initial manifest for the Uzbekvoice dataset.

Will download all files, extract them, and create a manifest file with the “audio_filepath”, “text” and “duration” fields.

Parameters:

raw_data_dir (str) – Path to the folder where the data archive should be downloaded and extracted.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "text": <transcription>,
}

Earnings21/22#

sdp.processors.datasets.earnings.CreateInitialAudioAndManifest[source]#

Create initial audio manifest from Earnings21/22 dataset files.

This processor creates the initial manifest for Earnings21/22 datasets by discovering audio files and creating manifest entries with duration information. Audio format conversion should be handled by a separate FfmpegConvert processor in the pipeline.

Parameters:
  • dataset_root (str) – Path to the root directory of the dataset.

  • raw_audio_source_dir (str) – Path to the directory containing raw audio files.

  • output_manifest_file (str) – Path where the output manifest will be saved.

  • dataset_type (str) – Type of dataset (“earnings21” or “earnings22”). Defaults to “earnings21”.

  • subset (str) – Dataset subset (“full” or “eval10” for earnings21 only). Defaults to “full”.

  • test_mode (bool) – If True, process only 2 files for testing. Defaults to False.

Returns:

Manifest entries with audio_filepath, duration, text (placeholder), and file_id fields. Use FfmpegConvert processor afterwards for audio format standardization.

Example

- _target_: sdp.processors.datasets.earnings.CreateInitialAudioAndManifest
  dataset_root: /path/to/earnings21
  raw_audio_source_dir: ${dataset_root}/media
  output_manifest_file: ${output_dir}/01_initial_manifest.json
  dataset_type: earnings21
  subset: full
  test_mode: false
sdp.processors.datasets.earnings.CreateFullAudioManifestEarnings21[source]#

Add ground truth text from NLP token files to audio manifest.

This processor reconstructs the complete transcribed text for each audio file by reading the corresponding NLP token files and combining tokens with proper spacing and punctuation. It preserves the original punctuation and capitalization from the dataset.

Parameters:
  • input_manifest_file (str) – Path to the input manifest file.

  • dataset_root (str) – Path to the root directory of the dataset.

  • output_manifest_file (str) – Path where the output manifest will be saved.

  • dataset_type (str) – Type of dataset (“earnings21” or “earnings22”). Defaults to “earnings21”.

  • preserve_punctuation (bool) – Whether to preserve punctuation marks. Defaults to True.

  • preserve_capitalization (bool) – Whether to preserve original capitalization. Defaults to True.

Returns:

Manifest entries with the original fields plus populated text field containing the complete reconstructed transcript for each audio file.

Example

- _target_: sdp.processors.datasets.earnings.CreateFullAudioManifestEarnings21
  input_manifest_file: ${output_dir}/01_initial_manifest.json
  dataset_root: /path/to/earnings21
  output_manifest_file: ${output_dir}/02_manifest_with_text.json
  dataset_type: earnings21
  preserve_punctuation: true
  preserve_capitalization: true
sdp.processors.datasets.earnings.SpeakerSegmentedManifest[source]#

Create speaker-level segments based on speaker changes in NLP files.

This processor creates segments where each segment corresponds to continuous speech from a single speaker. It reads NLP token files to detect speaker changes and creates separate manifest entries for each speaker segment without timing calculations.

Parameters:
  • input_manifest_file (str) – Path to the input manifest file.

  • dataset_root (str) – Path to the root directory of the dataset.

  • output_manifest_file (str) – Path where the output manifest will be saved.

  • dataset_type (str) – Type of dataset (“earnings21” or “earnings22”). Defaults to “earnings21”.

  • preserve_punctuation (bool) – Whether to preserve punctuation marks. Defaults to True.

  • preserve_capitalization (bool) – Whether to preserve original capitalization. Defaults to True.

  • include_speaker_info (bool) – Whether to include speaker information. Defaults to True.

  • include_tags (bool) – Whether to include entity tags (earnings21 only). Defaults to False.

  • use_speaker_metadata_csv (bool) – Whether to use speaker metadata CSV for name mapping. Defaults to False.

Returns:

Manifest entries segmented by speaker with audio_filepath, duration (set to 0), text, file_id, segment_id, and optionally speaker and tags fields.

Example

- _target_: sdp.processors.datasets.earnings.SpeakerSegmentedManifest
  input_manifest_file: ${output_dir}/02_manifest_with_text.json
  dataset_root: /path/to/earnings21
  output_manifest_file: ${output_dir}/06_speaker_segments.json
  dataset_type: earnings21
  include_speaker_info: true
  include_tags: false
sdp.processors.datasets.earnings.CreateSentenceSegmentedManifest[source]#

Create sentence-level segments from word-level CTM alignment files.

This processor reads CTM (Conversation Time Mark) files generated by forced alignment and creates sentence-level segments based on punctuation patterns. It intelligently segments on sentence-ending punctuation while excluding abbreviations and numbers.

Parameters:
  • input_manifest_file (str) – Path to the input manifest file.

  • ctm_dir (str) – Path to the directory containing CTM files with word-level alignments.

  • output_manifest_file (str) – Path where the output manifest will be saved.

Returns:

Manifest entries with sentence-level segments containing audio_filepath, duration (calculated from CTM), text, file_id, segment_id, offset, end_time, and alignment fields with word-level timing information.

Example

- _target_: sdp.processors.datasets.earnings.CreateSentenceSegmentedManifest
  input_manifest_file: ${output_dir}/04_aligned_manifest.json
  ctm_dir: ${output_dir}/forced_alignment_output/ctm/words
  output_manifest_file: ${output_dir}/05_sentence_segments.json
sdp.processors.datasets.earnings.NeMoForcedAligner[source]#

Apply NeMo Forced Aligner to generate word-level timing alignments.

This processor uses NeMo’s forced alignment capabilities to generate precise word-level timing information by aligning ground truth text with audio files. It produces CTM files containing word-level timestamps and updates the manifest with alignment information.

Parameters:
  • input_manifest_file (str) – Path to the input manifest file.

  • output_manifest_file (str) – Path where the output manifest will be saved.

  • output_dir (str) – Directory where CTM files and other outputs will be saved.

  • pretrained_name (str) – Name or path of the NeMo ASR model to use for alignment.

  • device (str) – Device for computation (“cuda” or “cpu”). Defaults to “cuda”.

  • nemo_path (str) – Optional path to NeMo installation directory.

Returns:

Manifest entries with added alignment field containing word-level timing information and updated duration based on alignment results.

Example

- _target_: sdp.processors.datasets.earnings.NeMoForcedAligner
  input_manifest_file: ${output_dir}/03_cleaned_manifest.json
  output_manifest_file: ${output_dir}/04_aligned_manifest.json
  output_dir: ${output_dir}/forced_alignment_output
  pretrained_name: nvidia/parakeet-tdt_ctc-1.1b
  device: cuda
  batch_size: 1
sdp.processors.datasets.earnings.ApplyEarnings21Normalizations[source]#

Apply text normalizations using Earnings21 dataset normalization files.

This processor reads normalization files provided with the Earnings21 dataset and applies text normalizations based on probability scores. It can use the highest probability normalization candidate or fallback to original text.

Parameters:
  • earnings21_root (str) – Path to the root directory of Earnings21 dataset.

  • use_top_candidate (bool) – Whether to use the highest probability candidate. Defaults to True.

  • fallback_to_original (bool) – Whether to fallback to original text if no normalization available. Defaults to True.

  • preserve_entity_tags (bool) – Whether to preserve entity tags during normalization. Defaults to True.

Returns:

Manifest entries with normalized text field based on the normalization files.

Example

- _target_: sdp.processors.datasets.earnings.ApplyEarnings21Normalizations
  earnings21_root: /path/to/earnings21
  use_top_candidate: true
  fallback_to_original: true
  preserve_entity_tags: true

MASC#

sdp.processors.CreateInitialManifestMASC[source]#

Processor for creating initial manifest for Massive Arabic Speech Corpus (MASC).

Dataset link: https://ieee-dataport.org/open-access/masc-massive-arabic-speech-corpus. Prior to calling processor download the tarred dataset and store it under raw_dataset_dir/masc.tar.gz.

Creates manifest from samples in . dataset_dir/subsets/data_split.csv. All meta information is kept.

Parameters:
  • raw_data_dir (str) – The root directory of the dataset.

  • extract_archive_dir (str) – Directory where the extracted data will be saved.

  • resampled_audios_dir (str) – Directory where the resampled audio will be saved.

  • data_split (str) – Dataset split type.

  • already_extracted (bool) – If True, we will not try to extract the raw data. Defaults to False.

  • target_samplerate (int) – Sample rate (Hz) to use for resampling. Defaults to 16000.

  • target_nchannels (int) – Number of channels to create during resampling process. Defaults to 1.

  • output_manifest_sample_id_key (str) – The field name to store sample ID. Defaults to “sample_id”.

  • output_manifest_vtt_filapath_key (str) – The field name to store vtt file path. Defaults to “vtt_filepath”.

  • output_manifest_audio_filapath_key (str) – The field name to store audio file path. Defaults to “audio_filepath”.

  • verbose (bool) – Set to True for more detailed logging.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "sample_id": <sample ID>
    "audio_filepath": <path to the audio file>,
    "vtt_filepath": <path to the vtt file>,
    "category": <video category>,
    "video_duration": <video duration>,
    "channel_id": <video channel ID>,
    "country": <video country>,
    "dialect": <speaker dialect>,
    "gender": <speaker gender>,
    "transcript_duration": <transcript duration>,
}

MediaSpeech#

sdp.processors.CreateInitialManifestMediaSpeech[source]#

Processor for creating initial manifest for MediaSpeech Arabic dataset. Dataset link: https://www.openslr.org/108/. Prior to calling processor download the tarred dataset and store it under raw_dataset_dir/AR.tgz or raw_dataset_dir/AR.tar.gz.

Parameters:
  • raw_data_dir (str) – The root directory of the dataset.

  • extract_archive_dir (str) – Directory where the extracted data will be saved.

  • resampled_audios_dir (str) – Directory where the resampled audio will be saved.

  • already_extracted (bool) – If True, we will not try to extract the raw data. Defaults to False.

  • target_samplerate (int) – Sample rate (Hz) to use for resampling. Defaults to 16000.

  • target_nchannels (int) – Number of channels to create during resampling process. Defaults to 1.

  • output_manifest_sample_id_key (str) – The field name to store sample ID. Defaults to “sample_id”.

  • output_manifest_audio_filapath_key (str) – The field name to store audio file path. Defaults to “audio_filepath”.

  • output_manifest_text_key (str) – The field name to store text. Defaults to “text”.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "text": <text>,
}

HuggingFace Datasets#

sdp.processors.CreateInitialManifestHuggingFace[source]#

Processor to create initial manifest for HuggingFace dataset.

Downloads HuggingFace dataset and creates an initial manifest.

Parameters:
  • dataset_name (str) – the name of the dataset. E.g., “tarteel-ai/everyayah”

  • raw_data_dir (str) – the path to the directory containing the raw dataset files.

  • resampled_audio_dir (str) – directory where the resampled audio will be saved.

  • data_split (str) – “train”, “validation” or “test”.

  • already_downloaded (bool) – if True, we will not try to load dataset from HuggingFace. Defaults to False.

  • target_samplerate (int) – sample rate (Hz) to use for resampling. Defaults to 16000.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "duration": <duration of the audio in seconds>,
    "text": <transcription (with capitalization and punctuation)>,
}

YTC Datasets#

sdp.processors.datasets.ytc.create_initial_manifest.CreateInitialManifestYTC[source]#

A processor class for creating initial manifest files for a TTS dataset.

It takes a manifest file containing audio file paths and resamples them to a target sample rate and format, while creating a new manifest file with the updated paths.

Parameters:
  • input_format (str) – Format of the input audio files

  • resampled_audio_dir (str) – Directory where resampled audio files will be saved

  • target_sample_rate (int) – Desired sample rate for the output audio files

  • target_format (str) – Desired format for the output audio files

  • target_nchannels (int) – Desired number of channels for the output audio files

Returns:

The same data as in the input manifest, but with resampled audio files and updated audio file paths.

Example

- _target_: sdp.processors.datasets.ytc.create_initial_manifest.CreateInitialManifestYTC
  input_manifest_file: ${workspace_dir}/manifest.json
  output_manifest_file: ${workspace_dir}/manifest_resampled.json

HiFiTTS-2#

sdp.processors.DownloadHiFiTTS2[source]#

Downloads HiFiTTS-2 dataset to local machine. Unsegmented audiobook chapters are first downloaded at a 48 kHz from LibriVox. Each chapter is then split into segmented utterance files based on precomputed offsets and durations.

To reduce disk use, the chapter files can be optionally deleted after they are segmented.

Metadata for chapters which fail to download due to network errors are stored in an output manifest file, which can be given as input to this processor to attempt the downloads again.

Parameters:
  • audio_dir (str) – Root directory where utterance files will be saved.

  • chapter_dir (str) – Root directory where audiobook chapter files will be saved.

  • sample_rate (int) – Sample rate to use for utterance files.

  • delete_chapter_files (bool) – Whether to delete each chapter file after it is done being processed.

  • exit_on_error (bool) – Whether to terminate the entire processor script if a single chapter downlaod fails.

  • num_retries (int) – Number of times to retry chapter download after encountering intermittent HTTP errors.

Returns:

Utterance files are stored under ‘audio_dir’ and chapter files are downloaded under ‘chapter_dir’.

If exit_on_error is False, then an output manifest will be saved with manifest entries that fail to downlaod, with error information stored under the ‘error_code’ and ‘error_reason’ fields.

Example

- _target_: sdp.processors.DownloadHiFiTTS2
  input_manifest_file: ${workspace_dir}/manifest_22khz.json
  output_manifest_file: ${workspace_dir}/errors_22khz.json
  audio_dir: ${workspace_dir}/audio_22khz
  chapter_dir: ${workspace_dir}/chapters
  max_workers: 8
sdp.processors.RemovedFailedChapters[source]#

Removes all utterances in the input chapter file from the input manifest. This processor is expected to be run using the file output by the DownloadHiFiTTS2 containing failed chapter downloads.

Parameters:

error_file (str) – Path to file with chapter download errors.

Returns:

This outputs a manifest which is the same as its input manifest but with utterances in ‘error_file’ removed.

Example

- _target_: sdp.processors.RemovedFailedChapters
  input_manifest_file: ${workspace_dir}/manifest_22khz.json
  output_manifest_file: ${workspace_dir}/manifest_filtered_22khz.json
  error_file: ${workspace_dir}/errors_22khz.json

Lhotse processors#

The following processors leverage Lhotse, a speech data handling library that contains data preparation recipes for 80+ publicly available datasets. Lhotse has its own data manifest format that can be largely mapped into NeMo’s format.

sdp.processors.LhotseImport[source]#

Processor to create an initial manifest imported from a Lhotse CutSet. The input_manifest_file is expected to point to a Lhotse CutSet manifest, which usually has cuts in its name and a .jsonl or .jsonl.gz extension.

Lhotse is a library for speech data processing and loading; see:

It can be installed using pip install lhotse.

Caution

Currently we only support the importing of cut sets that represent single-channel, single-audio-file-per-utterance datasets.

Returns:

This processor generates an initial manifest file with the following fields:

{
    "audio_filepath": <path to the audio file>,
    "duration": <duration of the audio in seconds>,
    "text": <transcription (with capitalization and punctuation)>,
}

Data enrichment#

The following processors can be used to add additional attributes to the data by running different NeMo models (e.g., ASR predictions). These attributes are typically used in the downstream processing for additional enhancement or filtering.

sdp.processors.ASRInference[source]#

This processor performs ASR inference on each utterance of the input manifest.

ASR predictions will be saved in the pred_text key.

Parameters:
  • pretrained_model (str, Optional) – the name or the filepath of the pretrained NeMo ASR model which will be used to do inference.

  • batch_size (int) – the batch size to use for ASR inference. Defaults to 32.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseProcessor.

Returns:

The same data as in the input manifest with an additional field pred_text containing ASR model’s predictions.

sdp.processors.PCInference[source]#

Adds predictions of a text-based punctuation and capitalization (P&C) model.

Operates on the text in the input_text_field, and saves predictions in the output_text_field.

Parameters:
  • input_text_field (str) – the text field that will be the input to the P&C model.

  • output_text_field (str) – the text field where the output of the PC model will be saved.

  • batch_size (int) – the batch sized used by the P&C model.

  • device (str, Optional) – the device used by the P&C model. Can be skipped to auto-select.

  • pretrained_name (str, Optional) – the pretrained_name of the P&C model.

  • model_path (str, Optional) – the model path to the P&C model.

  • **kwargs – Additional keyword arguments to be passed to the base class PCInference.

Note

Either pretrained_name or model_path have to be specified.

Returns:

The same data as in the input manifest with an additional field <output_text_field> containing P&C model’s predictions.

sdp.processors.ASRTransformers[source]#

This processor transcribes audio files using HuggingFace ASR Transformer models.

It processes audio files from the manifest and adds transcriptions using the specified pre-trained model from HuggingFace.

Parameters:
  • pretrained_model (str) – Name of pretrained model on HuggingFace.

  • output_text_key (str) – Key to save transcription result in the manifest.

  • input_audio_key (str) – Key to read audio file paths from the manifest. Default: “audio_filepath”.

  • input_duration_key (str) – Key for audio duration in the manifest. Default: “duration”.

  • device (str) – Inference device (e.g., “cuda”, “cpu”). Default: None.

  • batch_size (int) – Inference batch size. Default: 1.

  • chunk_length_s (int) – Length of audio chunks in seconds. Default: 0.

  • torch_dtype (str) – Tensor data type for model inference. Default: “float32”.

  • generate_task (str) – Task type for generation. Default: “transcribe”.

  • generate_language (str) – Language for generation. Default: “english”.

  • max_new_tokens (int, Optional) – Maximum number of new tokens to generate. Default: None.

Returns:

A manifest with transcribed text added to each entry under the specified output_text_key.

sdp.processors.tts.pyannote.PyAnnoteDiarizationAndOverlapDetection[source]#

This processor performs speaker diarization and overlap detection using PyAnnote.

It processes audio files to identify different speakers and detect overlapping speech segments using PyAnnote’s speaker diarization pipeline and VAD (Voice Activity Detection). The processor segments audio into speaker turns and identifies regions with overlapping speech.

Parameters:
  • hf_token (str) – HuggingFace authentication token for accessing pretrained models

  • segmentation_batch_size (int, Optional) – Batch size for segmentation. Defaults to 128

  • embedding_batch_size (int, Optional) – Batch size for speaker embeddings. Defaults to 128

  • min_length (float, Optional) – Minimum length of segments in seconds. Defaults to 0.5

  • max_length (float, Optional) – Maximum length of segments in seconds. Defaults to 40

  • device (str, Optional) – Device to run the models on (‘cuda’ or ‘cpu’). Defaults to “cuda”

Returns:

The same data as in the input manifest, but with speaker diarization and overlap detection information added to each segment.

Example

- _target_: sdp.processors.tts.pyannote.PyAnnoteDiarizationAndOverlapDetection
  input_manifest_file: ${workspace_dir}/manifest.json
  output_manifest_file: ${workspace_dir}/manifest_diarized.json
  hf_token: ${hf_token}
sdp.processors.tts.nemo_asr_align.NeMoASRAligner[source]#

This processor aligns text and audio using NeMo ASR models.

It uses a pre-trained ASR model to transcribe audio files and generate word-level alignments with timestamps. The processor supports both CTC and RNNT decoders and can process either full audio files or just specific segments.

Parameters:
  • model_name (str) – Name of pretrained model to use. Defaults to “nvidia/parakeet-tdt_ctc-1.1b”

  • model_path (str, Optional) – Path to local model file. If provided, overrides model_name

  • min_len (float) – Minimum length of audio segments to process in seconds. Defaults to 0.1

  • max_len (float) – Maximum length of audio segments to process in seconds. Defaults to 40

  • parakeet (bool) – Whether model is a Parakeet model. Affects time stride calculation. Defaults to True

  • ctc (bool) – Whether to use CTC decoding. Defaults to False

  • batch_size (int) – Batch size for processing. Defaults to 32

  • num_workers (int) – Number of workers for data loading. Defaults to 10

  • split_batch_size (int) – Maximum size for splitting large batches. Defaults to 5000

  • timestamp_type (str) – Type of timestamp to generate (“word” or “char”). Defaults to “word”

  • infer_segment_only (bool) – Whether to process only segments instead of full audio. Defaults to False

  • device (str) – Device to run the model on. Defaults to “cuda”

Returns:

The same data as in the input manifest, but with word-level alignments added to each segment.

Example

- _target_: sdp.processors.tts.nemo_asr_align.NeMoASRAligner
  input_manifest_file: ${workspace_dir}/manifest.json
  output_manifest_file: ${workspace_dir}/manifest_aligned.json
  parakeet: True
sdp.processors.tts.metrics.TorchSquimObjectiveQualityMetricsProcessor[source]#

This processor calculates Squim quality metrics for audio files.

It uses a pre-trained Squim model to calculate audio quality metrics like PESQ, STOI and SI-SDR for each audio segment in the manifest:

PESQ (Perceptual Evaluation of Speech Quality) A measure of overall quality for speech (originally designed to detect codec distortions but highly correlated to all kinds of distortion.

STOI (Short-Time Objective Intelligibility) A measure of speech intelligibility, basically measures speech envelope integrity. A STOI value of 1.0 means 100% of the speech being evaluated is intelligible on average.

SI-SDR (Scale-Invariant Signal-to-Distortion Ratio) A measure of how strong the speech signal is vs. all the distortion present in the audio, in decibels. 0 dB means the energies of speech and distortion are the same. A value between 15-20 dB is what is considered “clean enough” speech in general.

Parameters:

device (str, Optional) – Device to run the model on. Defaults to “cuda”.

Returns:

The same data as in the input manifest, but with quality metrics added to each segment’s metrics field.

Example

- _target_: sdp.processors.tts.metrics.TorchSquimObjectiveQualityMetricsProcessor
  input_manifest_file: ${workspace_dir}/manifest.json
  output_manifest_file: ${workspace_dir}/manifest_squim.json
sdp.processors.tts.metrics.BandwidthEstimationProcessor[source]#

This processor estimates audio bandwidth by analyzing power spectra.

It analyzes audio files to estimate their effective bandwidth by examining the power spectrum and determining the highest frequency with significant energy content above a threshold.

Parameters:
  • n_fft (int, Optional) – Size of FFT window. Defaults to 512

  • stride_seconds (float, Optional) – Time between successive FFT windows in seconds. Defaults to 0.01

  • top_db (float, Optional) – Maximum decibel value for power spectrum normalization. Defaults to 100.0

  • frequency_threshold (float, Optional) – Threshold in dB below peak for bandwidth estimation. Defaults to -50.0

Returns:

The same data as in the input manifest, but with bandwidth estimates added to each segment.

Example

- _target_: sdp.processors.tts.metrics.BandwidthEstimationProcessor
  input_manifest_file: ${workspace_dir}/manifest.json
  output_manifest_file: ${workspace_dir}/manifest_with_bandwidth.json
sdp.processors.FasterWhisperInference[source]#

Processor that performs parallel audio transcription using the FasterWhisper model.

This class reads a manifest of audio files, transcribes them using multiprocessing (each device or CPU thread handles a portion), and writes results in a NeMo-compatible manifest.

Parameters:
  • input_manifest_file (str) – Path to the input manifest.

  • output_manifest_file (Optional[str]) – Path to the output manifest (default: <output_dir>/predictions_all.json).

  • model_size_or_path (str) – Whisper model path or model name (e.g., ‘base’, ‘medium’).

  • device (str) – Device type to use (‘auto’, ‘cuda’, or ‘cpu’).

  • num_devices (int) – Number of workers/devices to use (-1 = all available).

  • model_download_root (Optional[str]) – Directory where model checkpoints will be downloaded.

  • output_dir (Optional[str]) – Directory to store output predictions and timestamps.

  • skip_corrupted_audios (bool) – Whether to skip audio files that raise exceptions.

  • save_timestamps_separately (bool) – If True, saves segment/word timestamps as separate files.

  • slice_by_offset (bool) – If True, slices audio using offset/duration before inference.

  • inference (Optional[Dict]) – Additional inference parameters for Whisper.

  • language_detection_only (bool) – If True, only perform language detection.

  • in_memory_chunksize (int) – Number of samples to load per worker at once.

  • audio_filepath_field (str) – Name of the field in manifest pointing to audio path.

Returns:

A final merged manifest file where each line corresponds to the transcription result of an input audio sample. The manifest is assembled from multiple per-worker (rank) manifest files, each produced by a separate device or process.

Each entry contains the following fields:

  • language (str, optional): Detected language (if language detection is enabled).

  • language_probability (float, optional): Confidence score of detected language.

  • pred_text (str): Final transcribed text obtained by concatenating all segment texts.

One of the following timestamp representations will also be included, depending on the value of save_timestamps_separately:

  • If save_timestamps_separately=False:
    • segments (List[Dict]): List of segment dictionaries with start/end timestamps and transcribed text.

  • If save_timestamps_separately=True:
    • segments (str): Path to a JSON file containing segment-level timestamps.

    • words (str, optional): Path to a JSON file containing word-level timestamps (if word_timestamps=True).

The final combined manifest is written to output_manifest_file, which defaults to <output_dir>/predictions_all.json.

Note

Make sure to install the following packages before using this processor:

pip install pytorch-lightning nvidia-cublas-cu12 nvidia-cudnn-cu12==9.* faster_whisper

Additionally, ensure that the dynamic libraries for cuBLAS and cuDNN are discoverable at runtime:

export LD_LIBRARY_PATH=`python3 -c ‘import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + “:” + os.path.dirname(nvidia.cudnn.lib.__file__))’`

This is required for CUDA backend components to function correctly when using FasterWhisper with GPU acceleration.

For detailed configuration options and advanced usage of FasterWhisper, refer to the official repository: SYSTRAN/faster-whisper

Example

- _target_: sdp.processors.FasterWhisperInference
   input_manifest_file: /your/input/manifest.json
   output_manifest_file: /your/output/manifest.json
   model_size_or_path: base
sdp.processors.vLLMInference[source]#

A processor that performs inference using a vLLM model on entries from an input manifest.

This class supports three prompt configuration modes: - a static prompt template (prompt) - a field in each entry containing the prompt (prompt_field) - a YAML file containing the prompt structure (prompt_file)

The prompts are converted into chat-style input using a tokenizer chat template, passed to the vLLM engine for generation, and the results are written to an output manifest.

Parameters:
  • prompt (str, optional) – A fixed prompt used for all entries.

  • prompt_field (str, optional) – The key in each entry that holds the prompt template.

  • prompt_file (str, optional) – Path to a YAML file containing the prompt structure.

  • generation_field (str) – Name of the output field to store generated text. Default is ‘generation’.

  • model (dict) – Parameters to initialize the vLLM model.

  • inference (dict) – Sampling parameters passed to vLLM.SamplingParams.

  • apply_chat_template (dict) – Arguments passed to the tokenizer’s apply_chat_template method.

  • **kwargs – Passed to the BaseProcessor (includes input_manifest_file and output_manifest_file).

Raises:

ValueError – If zero or more than one prompt configuration methods are used simultaneously.

Returns:

A line-delimited JSON manifest where each entry includes the original fields plus a field with the generated output.

Note

For detailed parameter options, refer to the following documentation:

Make sure to install optree>=0.13.0 and vllm before using this processor:

pip install “optree>=0.13.0” vllm

sdp.processors.AudioLid[source]#

Processor for language identification (LID) of audio files using a pre-trained LID model.

Parameters:
  • input_audio_key (str) – The key in the dataset containing the path to the audio files for language identification.

  • pretrained_model (str) – The name of the pre-trained ASR model for language identification.

  • output_lang_key (str) – The key to store the identified language for each audio file.

  • device (str) – The device to run the ASR model on (e.g., ‘cuda’, ‘cpu’). If None, it automatically selects the available GPU if present; otherwise, it uses the CPU.

  • segment_duration (float) – Random sample duration in seconds. Delault is np.inf.

  • num_segments (int) – Number of segments of file to use for majority vote. Delault is 1.

  • random_seed (int) – Seed for generating the starting position of the segment. Delault is None.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseProcessor.

sdp.processors.CometoidWMTQualityEstimation[source]#

A processor for estimating translation quality using pretrained COMET-like models based on MarianNMT and the pymarian Evaluator.

This processor evaluates the quality of source-target text pairs (bitext) using COMETOID-style quality estimation and appends the resulting score to each dataset entry.

Parameters:
  • source_text_field (str) – The key in the data entry containing the source (original) text.

  • target_text_field (str) – The key in the data entry containing the target (translated) text.

  • model_name_or_path (str) – Hugging Face model name or path to local model checkpoint.

  • vocab_path (str, optional) – Path to the vocabulary file. If None and model is from HF, it will be downloaded.

  • save_model_to (str, optional) – Directory to download and cache the model and vocab.

  • mini_batch (int) – Mini-batch size for evaluation.

  • maxi_batch (int) – Maxi-batch size for evaluation.

  • output_field (str) – The name of the field where the quality score will be saved in the output manifest.

  • device_type (str) – Device type to use: ‘cpu’ or ‘gpu’.

  • num_devices (int) – Number of CPU threads or GPU devices to use. Use -1 to use all available.

  • chunksize (int) – Number of lines to process in each chunk.

Returns:

A manifest file where each entry has an added key (output_field) with the computed score.

Note

This processor uses MarianNMT models fine-tuned for quality estimation. See https://marian-nmt.github.io/.

Make sure to install pymarian before using this processor:

pip install pymarian

sdp.processors.FastTextLangIdClassifier[source]#

This processor supports language identification using pretrained FastText models. It classifies text and adds the predicted label and probability to the dataset entry. If needed, it downloads the model, loads it into memory, and performs prediction on the specified input text field.

Parameters:
  • model_name_or_path (str) – Path to a FastText model file or the name of a supported remote model (‘lid.176.bin’ or ‘lid.176.ftz’).

  • text_field (str) – The name of the field in the dataset entry that contains the input text for classification.

  • output_field (str) – The name of the field to store the predicted label. Defaults to “label”.

  • top_k (int) – The number of top predictions to return. Defaults to 1 (-1 for all).

  • cache_dir (str, optional) – Directory to store the downloaded model file. If not provided, a temporary directory is used.

  • **kwargs – Additional keyword arguments passed to BaseParallelProcessor.

Returns:

A manifest where each entry contains the original data fields plus
  • <output_field>: The predicted label (e.g., language code for lid.176.bin).

  • <output_field>_prob: The probability of the prediction.

Note

Make sure to install fasttext before using this processor:

pip install fasttext

Text-only processors#

Note

All processors in this section accept additional parameter text_key (defaults to “text”) to control which field is used for modifications/filtering.

sdp.processors.ReadTxtLines[source]#

The text file specified in source_filepath will be read, and each line in it will be added as a line in the output manifest, saved in the field text_key.

Parameters:
  • input_file_key (str) – The key in the manifest containing the input txt file path .

  • text_key (str) – The key to store the read text lines in the manifest.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

Data modifications#

sdp.processors.SubRegex[source]

Applies a sequence of regex substitutions to the specified text field in each data entry.

This processor performs regex-based substitutions as defined in either a provided list of regex parameter dictionaries or a YAML configuration file. Each substitution is applied in the order specified.

Before substitutions are applied, a space is temporarily added to the beginning and end of the text to improve regex match consistency. After all substitutions, leading/trailing spaces and repeated spaces are removed.

Parameters:
  • regex_params_list (List[Dict], optional) –

    A list of dictionaries specifying the regex substitutions. Each dictionary must include:

    - "pattern": A regex pattern to match.
    - "repl": A replacement string.
    - "count" (optional): Maximum number of replacements to make. Defaults to 0 (replace all).
    

  • regex_params_yaml (str, optional) – Path to a YAML file that defines the same list of dictionaries as regex_params_list. Either regex_params_list or regex_params_yaml must be provided. If both are provided, regex_params_yaml takes precedence.

  • text_key (str) – The key in each data entry whose value will be modified. Defaults to “text”.

  • **kwargs – Additional arguments passed to the BaseParallelProcessor.

Example YAML format for regex_params_yaml:

` # regex_params.yaml - {"pattern": "♩", "repl": " "} - {"pattern": "♭", "repl": " "} - {"pattern": "\|", "repl": " "} - {"pattern": ":", "repl": " "} - {"pattern": "-", "repl": " "} - {"pattern": "[^ €₽₴$£%?!',.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzАБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЬЮЯабвгдежзийклмнопрстуфхцчшщъьюя]", "repl": ""} - {"pattern": "\s+\.", "repl": "."} - {"pattern": "\?+", "repl": "?"} - {"pattern": "\.+", "repl": "."} `

Returns:

The same data as in the input manifest with <text_key> field changed.

sdp.processors.SubMakeLowercase[source]#

Processor to convert text to lowercase.

text_key (str): a string indicating which key of the data entries

should be used to find the utterance transcript. Defaults to “text”.

Returns:

The same data as in the input manifest with <text_key> field changed.

sdp.processors.MakeLettersUppercaseAfterPeriod[source]#

Can be used to replace characters with upper-case version after punctuation.

Parameters:
  • punctuation (str) – string with all punctuation characters to consider. Defaults to “.!?”.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

Returns:

The same data as in the input manifest with <text_key> field changed.

sdp.processors.SplitLineBySentence[source]#

Processor for splitting lines of text into sentences based on a specified pattern. One line containing N sentences will be transformed into N lines containing one sentence.

Parameters:
  • text_key (str) – The field containing the text lines in the dataset.

  • end_pattern (str) – The regular expression pattern to identify sentence boundaries.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

sdp.processors.CountNumWords[source]#

A processor that counts the number of words in the text_key field of each dataset entry and stores the result in num_words_key.

Before counting, the text is optionally cleaned using a custom alphabet: - If alphabet is provided, all characters not in the alphabet are replaced with whitespace. - Consecutive whitespace characters are collapsed into a single space. - The number of resulting space-separated tokens is counted as the number of words.

Parameters:
  • text_key (str) – The key in the input data entry containing the text to be analyzed.

  • num_words_key (str) – The key under which the word count will be stored in the output entry. Defaults to “num_words”.

  • alphabet (str, optional) – A string of allowed characters (e.g., lowercase letters). All characters not in this set will be replaced with whitespace before counting. If not provided, no filtering is applied.

  • **kwargs – Additional arguments passed to the BaseParallelProcessor.

Returns:

“num_words”), indicating the number of words in the text_key field.

Return type:

A manifest where each entry is the original data entry with an added field num_words_key (default

sdp.processors.NormalizeText[source]#

This processor applies text normalization (TN) to the text. I.e. converts text from written form into its verbalized form. E.g., “$123” is converted to “one hundred and twenty-three dollars.”

Parameters:
  • input_text_key (str) – the text field that will be the input to the Normalizer. Defaults to: text.

  • input_language (str) – language specifying the text normalization rules in ISO 639 Set 1 format. E.g., “en”, “es”, “it”, etc. Defaults to: English.

  • input_case (str) – input text capitalization, set to cased if text contains capital letters. This flag affects normalization rules applied to the text. Note, lower_cased won’t lower case input. Defaults to: cased.

  • output_text_key (str) – the text field that will be the output from the Normalizer. Defaults to: text.

Returns:

This processor normalizes the text in the input_text_key field and saves the normalized text in output_text_key field.

Raises:

NotImplementedError – when TN is not implemented for the requested language.

sdp.processors.InverseNormalizeText[source]#

This processor applies inverse text normalization (ITN) to the text. I.e. transforms spoken forms of numbers, dates, etc into their written equivalents. E.g., “one hundred and twenty-three dollars.” is converted to “$123”.

Parameters:
  • input_text_key (str) – the text field that will be the input to the InverseNormalizer. Defaults to: text.

  • input_language (str) – language specifying the text normalization rules in ISO 639 Set 1 format. E.g., “en”, “es”, “it”, etc. Defaults to: English.

  • input_case (str) – input text capitalization, set to cased if text contains capital letters. This flag affects normalization rules applied to the text. Note, lower_cased won’t lower case input. Defaults to: cased.

  • output_text_key (str) – the text field that will be the output from the InverseNormalizer. Defaults to: text.

Returns:

This processor inverse normalizes the text in the input_text_key field and saves the inverse normalized text in output_text_key field.

Raises:

NotImplementedError – when ITN is not implemented for the requested language.

sdp.processors.LambdaExpression[source]#

A dataset processor that evaluates a Python expression on each data entry and either stores the result in a new field or uses it as a filtering condition.

This processor is useful for dynamic field computation or conditional filtering of entries based on configurable expressions. It leverages evaluate_expression, which safely evaluates expressions using the abstract syntax tree (AST).

Filtering behavior:

If filter=True, the expression is evaluated for each entry. Only entries for which the expression evaluates to True are kept; all others are filtered out (removed from the output). If filter=False, the result of the expression is stored in the field specified by new_field for each entry (no filtering occurs).

Examples:

# Example 1: Filtering entries where the duration is greater than 5.0 seconds
LambdaExpression(
    new_field="keep",  # This field is ignored when filter=True
    expression="entry['duration'] > 5.0",
    lambda_param_name="entry",
    filter=True
)
# Only entries with duration > 5.0 will be kept in the output manifest.

# Example 2: Adding a new field with the number of words in the text
LambdaExpression(
    new_field="num_words",
    expression="len(entry['text'].split())",
    lambda_param_name="entry",
    filter=False
)
# Each entry will have a new field 'num_words' with the word count of the 'text' field.

Supported operations:

The expression supports a safe subset of Python operations, including:

  • Arithmetic: +, -, *, /, //, %, **

  • Comparisons: ==, !=, <, <=, >, >=, is, is not

  • Logical: and, or, not

  • Bitwise: |, &, ^, ~, <<, >>

  • Indexing and slicing: entry['key'], entry[0], entry[1:3]

  • Conditional (ternary) expressions: a if cond else b

  • List and dict literals: [a, b], {k: v}

  • Attribute access: entry.attr

  • Function calls (limited): max, min, len, sum, abs, sorted

For the full list, see the OPERATORS and SAFE_FUNCTIONS in sdp.utils.apply_operators. See also: https://docs.python.org/3/library/operator.html

Parameters:
  • new_field (str) – The name of the field to store the result of the expression (ignored if filter=True).

  • expression (str) – A Python expression to evaluate. It can reference fields of the data entry using the name specified by lambda_param_name (default: ‘entry’).

  • lambda_param_name (str, optional) – The name to refer to the current data entry in the expression. Default is “entry”.

  • filter (bool, optional) – If True, the expression result is treated as a condition. The entry is kept only if the result is True. Default is False.

  • **kwargs – Additional keyword arguments passed to the BaseParallelProcessor class.

Returns:

A line-delimited JSON manifest, where each line is a processed entry. The result may contain fewer entries than the input if filter=True.

Return type:

str

sdp.processors.ListToEntries[source]#

A dataset processor that transforms a single entry containing a list of items into multiple entries, one for each item in the list.

This is useful when a manifest field (e.g., “segments”) contains a list of sub-entries, and you want to flatten these into individual records for further processing.

Parameters:
  • field_with_list (str) – The name of the field in the input entry that contains a list.

  • output_field (str, optional) – The name of the output field to assign to items in the list if they are not dictionaries. Required if the list contains primitive types (e.g., strings).

  • **kwargs – Additional arguments passed to the BaseParallelProcessor.

Raises:
  • TypeError – If the specified list field is not of type list.

  • ValueError – If the list items are not dictionaries and output_field is not provided.

Returns:

A manifest where each entry corresponds to one item in the original list from the input entry. This effectively transforms a single input entry containing a list of items into multiple standalone entries, each suitable for further dataset processing.

Example 1 (list of dicts)

- _target_: sdp.processors.ListToEntries
  input_manifest_file: ${workspace_dir}/input_manifest.json
  output_manifest_file: ${workspace_dir}/output_manifest.json
  field_with_list: "segments"

Input:

{
    "audio_filepath": "sample.wav",
    "segments": [
        {"start": 0.0, "end": 1.5, "text": "Hello"},
        {"start": 1.6, "end": 3.0, "text": "World"}
    ]
}

Output:

[
    {
        "audio_filepath": "sample.wav",
        "start": 0.0,
        "end": 1.5,
        "text": "Hello"
    },
    {
        "audio_filepath": "sample.wav",
        "start": 1.6,
        "end": 3.0,
        "text": "World"
    }
]

Example 2 (list of primitives)

- _target_: sdp.processors.ListToEntries
  input_manifest_file: ${workspace_dir}/input_manifest.json
  output_manifest_file: ${workspace_dir}/output_manifest.json
  field_with_list: "text_chunks"
  output_field: "text"

Input:

{
    "audio_filepath": "sample.wav",
    "text_chunks": [
        "Hello",
        "World"
    ]
}

Output:

[
    {
        "audio_filepath": "sample.wav",
        "text": "Hello"
    },
    {
        "audio_filepath": "sample.wav",
        "text": "World"
    }
]
sdp.processors.EstimateBandwidth[source]#

Adds estimated bandwidth to each utterance in the input manifest file.

Parameters:
  • audio_dir (str) – Root directory where audio files are stored.

  • input_audio_key (str) – Manifest key with relative audio paths.

  • output_bandwidth_key (str) – Manifest key to store estimated bandwidth in.

  • max_seconds (float) – The maximum length of audio to use for bandwidth estimation. By default, uses the first 30 seconds.

  • sample_rate (int) – Sample rate to resample audio to before doing bandwidth estimation. Defaults to 44100, upsampling the input audio as needed.

  • n_fft (int) – Number of FFT bins to use for bandwidth estimation. Defaults to 512.

  • hop_length (int) – Audio frame hop length to use for bandwidth estimation. Defaults to 441, corresponding to 0.01 seconds for 44100 sample rate.

  • top_db (float) – top_db treshhold to use for bandwidth estimation.

  • frequency_threshold (float) – Bandwidth estimation finds the highest frequency with mean power spectrum that is within ‘frequency_threshold’ dB of its peak power. Defaults to -50 dB.

Returns:

This processor estimates the bandwidth of the audio file in the`input_audio_key` field and saves the estimate

in the output_bandwidth_key` field.

Example

- _target_: sdp.processors.EstimateBandwidth
  input_manifest_file: ${workspace_dir}/manifest.json
  output_manifest_file: ${workspace_dir}/manifest_bandwidth.json
  audio_dir: ${workspace_dir}/audio_22khz
  max_workers: 8
sdp.processors.CharacterHistogramLangValidator[source]#

A processor that filters text based on character histogram similarity to trusted data in the target language.

This processor computes the ratio of characters in a given text that are found in a reference character histogram for a specific language. If this ratio is below a certain threshold, the text is likely mislabeled or noisy.

Histograms are sourced from the NLLB paper (https://arxiv.org/pdf/2207.04672), see page 30 for methodology. This technique is a lightweight language ID filter, designed to catch mismatches between text content and claimed language.

Reference implementation: facebookresearch/fairseq

Parameters:
  • text_field (str) – Key in the data entry containing the text to evaluate.

  • lang_field (str, optional) – Key in the data entry that identifies the language. Required if lang is not provided.

  • lang (str, optional) – Language code to use for all entries (overrides lang_field). Required if lang_field is not provided.

  • threshold (float) – Threshold ratio to determine if text matches the histogram. Used only externally (not enforced in this processor).

  • cache_dir (str, optional) – Directory where histograms are downloaded and cached.

  • threshold_char (str) – Character used to truncate the histogram file (default is ‘]’).

  • output_score_field (str) – Key name under which the computed character match ratio will be stored.

  • **kwargs – Additional keyword arguments passed to BaseParallelProcessor.

Raises:

ValueError – If both lang and lang_field are provided, or if neither is provided. Also raised if histogram for specified language is missing.

Returns:

A manifest where each entry includes the additional field output_score_field with the character match ratio.

Example:

{
    "text": "hello world",
    "lang": "en",
    "hist_token_ratio": 0.95
}

Data filtering#

sdp.processors.DropIfRegexMatch[source]#

Drops utterances if text matches a regex pattern.

Before applying regex checks, we will add a space character to the beginning and end of the text and pred_text keys for each data entry. After the the regex checks, assuming the utterance isn’t dropped, the extra spaces are removed. This includes the spaces in the beginning and end of the text, as well as any double spaces "  ".

Parameters:
  • regex_patterns (list[str]) – a list of strings. The list will be traversed in order. If data_entry.data[self.text_key] matches the regex, the entry will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropIfNoneOfRegexMatch[source]#

Drops utterances if data[self.text_key] does not match any of regex_patterns.

Before applying regex checks, we will add a space character to the beginning and end of the text and pred_text keys for each data entry. After the the regex checks, assuming the utterance isn’t dropped, the extra spaces are removed. This includes the spaces in the beginning and end of the text, as well as any double spaces "  ".

Parameters:
  • regex_patterns (list[str]) – If data_entry[self.text_key] does not match any of the regex patterns in the list, that utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropNonAlphabet[source]#

Drops utterances if they contain characters that are not in the alphabet.

Parameters:
  • alphabet (str) – a string containing all of the characters in our alphabet. If an utterance contains at least one character that is not in the alphabet, then that utterance will be dropped.

  • text_key (str) –

    a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

    Note

    Don’t forget to include spaces in your alphabet, unless you want to make sure none of the utterances contain spaces.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropOnAttribute[source]#

Drops utterances if attribute is set to True/False.

Parameters:
  • key (str) – which key to use for dropping utterances.

  • drop_if_false (bool) – whether to drop if value is False. Defaults to dropping if True.

Returns:

The same data as in the input manifest with some entries dropped.

ASR-based processors#

Note

All processors in this section depend on the sdp.processors.ASRInference. So make sure to include it in the config at some prior stage with an applicable ASR model.

Note

All processors in this section accept additional parameters text_key (defaults to “text”) and pred_text_key (defaults to “text_pred”) to control which fields contain transcription and ASR model predictions.

sdp.utils.BootstrapProcessor[source]#

This processor evaluates ASR performance metrics using bootstrapped confidence intervals.

It calculates metrics such as Word Error Rate (WER), Character Error Rate (CER), Word Match Rate (WMR), character rate, and word rate. When calculate_pairwise is set to True, it also computes the Probability of Improvement (POI) between different ASR models.

This implementation leverages bootstrapping to provide robust confidence intervals for each metric, helping to understand the variability in metric estimates and the likelihood that one model performs better than another.

Reference: Bootstrap estimates for confidence intervals in ASR performance evaluation: <https://ieeexplore.ieee.org/document/1326009>

Parameters:
  • bootstrap_manifest_files (List[str]) – A list of file paths to manifest files (in JSON Lines format) used for metric calculation. Each manifest file contains the ground truth and predicted transcriptions.

  • raw_data_dir (str) – The directory containing the data files referenced in the manifests.

  • output_file (str) – Path to the output JSON file where results will be saved.

  • num_bootstraps (int) – The number of bootstrap iterations to perform, which determines the reliability of the confidence intervals (default: 1000).

  • bootstrap_sample_ratio (float) – Proportion of the dataset size used for each bootstrap sample, allowing sub-sampling or over-sampling (default: 1.0, meaning full dataset).

  • calculate_pairwise (bool) – Whether to calculate pairwise differences in metric values between models and compute the Probability of Improvement (default: True).

  • metric_type (str) – Specifies the metric to calculate. Options include ‘wer’, ‘cer’, ‘wmr’, ‘charrate’, and ‘wordrate’ (default: ‘wer’).

  • text_key (str) – Key in the manifest that contains the ground truth text (default: ‘text’).

  • pred_text_key (str) – Key in the manifest that contains the predicted text (default: ‘pred_text’).

  • ci_lower (float) – The lower bound percentile for the confidence intervals (default: 2.5).

  • ci_upper (float) – The upper bound percentile for the confidence intervals (default: 97.5).

  • random_state (int) – Sets a random state for reproducibility of bootstrap sampling.

Returns:

Results saved in a JSON file at the specified output_file path, containing individual metric computations for each manifest file and pairwise comparisons between each model if calculate_pairwise is enabled.

Data modifications#

sdp.processors.InsIfASRInsertion[source]#

Processor that adds substrings to transcription if they are present in ASR predictions.

Will insert substrings into data[self.text_key] if it is present at that location in data[self.pred_text_key]. It is useful if words are systematically missing from ground truth transcriptions.

Parameters:
  • insert_words (list[str]) – list of strings that will be inserted into data[self.text_key] if there is an insertion (containing only that string) in data[self.pred_text_key].

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) –

    a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

    Note

    Because this processor looks for an exact match in the insertion, we recommend including variations with different spaces in insert_words, e.g. [' nemo', 'nemo ', ' nemo '].

Returns:

The same data as in the input manifest with <text_key> field changed.

sdp.processors.SubIfASRSubstitution[source]#

Processor that substitutes substrings to transcription if they are present in ASR predictions.

Will convert a substring in data[self.text_key] to a substring in data[self.pred_text_key] if both are located in the same place (ie are part of a ‘substitution’ operation) and if the substrings correspond to key-value pairs in sub_words. This is useful if words are systematically incorrect in ground truth transcriptions.

Before starting to look for substitution, this processor adds spaces at the beginning and end of data[self.text_key] and data[self.pred_text_key], to ensure that an argument like sub_words = {"nmo ": "nemo "} would cause a substitution to be made even if the original data[self.text_key] ends with "nmo" and data[self.pred_text_key] ends with "nemo".

Parameters:
  • sub_words (dict) – dictionary where a key is a string that might be in data[self.text_key] and the value is the string that might be in data[self.pred_text_key]. If both are located in the same place (i.e. are part of a ‘substitution’ operation) then the key string will be converted to the value string in data[self.text_key].

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) –

    a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

    Note

    This processor looks for exact string matches of substitutions, so you may need to be careful with spaces in sub_words. E.g. it is recommended to do sub_words = {"nmo ": "nemo "} instead of sub_words = {"nmo" : "nemo"}.

Returns:

The same data as in the input manifest with <text_key> field changed.

Files management#

sdp.processors.SoxConvert[source]#

Processor for Sox to convert audio files to specified format.

Parameters:
  • output_manifest_file (str) – Path to the output manifest file.

  • input_audio_file_key (str) – Key in the manifest file that contains the path to the input audio file.

  • output_audio_file_key (str) – Key in the manifest file that contains the path to the output audio file.

  • converted_audio_dir (str) – Path to the directory where the converted audio files will be stored.

  • output_format (str) – Format of the output audio file.

  • rate (int) – Sample rate of the output audio file.

  • channels (int) – Number of channels of the output audio file.

  • workspace_dir (str, Optional) – Path to the workspace directory. Defaults to None.

sdp.processors.FfmpegConvert[source]#

Processor for converting video or audio files to audio using FFmpeg and updating the dataset with the path to the resampled audio. If id_key is not None, the output file path will be <resampled_audio_dir>/<id_key>.wav. If id_key is None, the output file path will be <resampled_audio_dir>/<input file name without extension>.wav.

Note

id_key can be used to create subdirectories inside resampled_audio_dir (by using forward slashes /). e.g. if id_key takes the form dir_name1/dir_name2/filename, the output file path will be

<resampled_audio_dir>/dir_name1/dirname2/filename.wav.

Parameters:
  • converted_audio_dir (str) – The directory to store the resampled audio files.

  • input_file_key (str) – The field in the dataset representing the path to the input video or audio files.

  • output_file_key (str) – The field in the dataset representing the path to the resampled audio files with output_format. If id_key is None, the output file path will be <resampled_audio_dir>/<input file name without extension>.wav.

  • id_key (str) – (Optional) The field in the dataset representing the unique ID or identifier for each entry. If id_key is not None, the output file path will be <resampled_audio_dir>/<id_key>.wav. Defaults to None.

  • output_format (str) – (Optional) Format of the output audio files. Defaults to wav.

  • target_samplerate (int) – (Optional) The target sampling rate for the resampled audio. Defaults to 16000.

  • target_nchannels (int) – (Optional) The target number of channels for the resampled audio. Defaults to 1.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

sdp.processors.ExtractTar[source]#

A processor that extracts .tar archives for each entry in a dataset.

This processor reads a filepath to a tar archive from a specific field in the dataset entry, extracts the contents into a specified directory, and optionally appends the extracted file paths or the extraction directory to the entry under a new field.

Parameters:
  • field_to_tar_filepath (str) – The field in the input entry that contains the path to the .tar file.

  • extraction_dir (str) – The base directory where extracted files should be placed.

  • remove_source_tar (bool) – If True, deletes the original .tar file after successful extraction.

  • skip_invalid_filepaths (bool) – If True, logs and skips invalid paths instead of raising exceptions.

  • filepath_prefix_field (str) – Optional field in the entry used as a subdirectory prefix under extraction_dir.

  • output_filepath_field (str) – Field name where the output (path or list of paths) will be stored.

  • get_extracted_filepaths (bool) – If True, collects and returns a list of all extracted file paths.

Returns:

A manifest where each entry is updated with the path to the extracted files or directory.

sdp.processors.RemoveFiles[source]#

A processor that removes files or directories from the filesystem based on a filepath specified in the input data entry.

This processor is typically used for cleanup tasks after processing files.

Parameters:
  • filepath_field (str) – The key in the data entry that holds the path to the file or directory to remove.

  • drop_filepath_field (bool) – Whether to remove the filepath field from the resulting data entry. Defaults to True.

  • recursive (bool) – Whether to recursively remove files from directories. Defaults to False.

  • **kwargs – Additional arguments passed to the BaseParallelProcessor.

Returns:

A manifest where each entry is the same as the input, optionally without the filepath field, and with the file or directory at the specified path removed from disk.

Example entry before processing:

{
    "id": "abc123",
    "path_to_remove": "/tmp/some_file.wav"
}

Example entry after processing (if drop_filepath_field=True):

{
    "id": "abc123"
}
sdp.processors.ConvertToTarredAudioDataset[source]#

A processor for converting audio manifests into tarred audio datasets.

This processor optionally splits data into duration-based buckets, and calls the create_tar_datasets utility to convert and shard audio data into tar files, with accompanying manifest files.

Parameters:
  • output_manifest_file (str) – Path to the final output manifest.

  • input_manifest_file (str) – Path to the input manifest to be tarred.

  • **cfg_kwargs – Additional keyword arguments passed to the configuration dataclass.

Returns:

Writes a tarred and sharded audio dataset to disk.

  • The dataset consists of multiple .tar archives with audio files.

  • A final manifest (JSON lines format) is written to output_manifest_file, referencing each sample, its path inside the tar, and other metadata.

  • If buckets_num > 1, each sample will include an additional bucket_id field.

Note

If buckets_num > 1, the input manifest is split into multiple duration buckets, and each bucket is processed independently. A bucket_id is added to each sample.

You may need to install the extra dependencies of Lhotse and NeMo for this processor to work correctly: pip install lhotse "nemo-toolkit[common]"

Data filtering#

sdp.processors.PreserveByValue[source]#

Processor for preserving dataset entries based on a specified condition involving a target value and an input field.

Parameters:
  • input_value_key (str) – The field in the dataset entries to be evaluated.

  • target_value (Union[int, str]) – The value to compare with the input field.

  • operator (str) – (Optional) The operator to apply for comparison. Options: “lt” (less than), “le” (less than or equal to), “eq” (equal to), “ne” (not equal to), “ge” (greater than or equal to), “gt” (greater than). Defaults to “eq”.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

sdp.processors.DropASRError[source]#

Drops utterances if there is a sufficiently long ASR mismatch anywhere in the utterance.

Parameters:
  • consecutive_words_threshold (int) – will drop if there is a mismatch of at least this many words in a row.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropASRErrorBeginningEnd[source]#

Drops utterances if there is a sufficiently long ASR mismatch at the beginning or end of the utterance.

Parameters:
  • beginning_error_char_threshold (int) – if there is an insertion or deletion at the beginning of the utterance that has more characters than this number, then the utterance will be dropped. If there is a substitution at the beginning of the utterance, then the utterance will be dropped if abs(len(deletion) - len(insertion)) > beginning_error_char_threshold.

  • end_error_char_threshold (int) – if there is an insertion or deletion at the end of the utterance that has more characters than this number, then the utterance will be dropped. If there is a substitution at the end of the utterance, then the utterance will be dropped if abs(len(deletion) - len(insertion)) > end_error_char_threshold.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropIfSubstringInInsertion[source]#

Drops utterances if a substring matches an ASR insertion.

Insertions are checked between data[self.text_key] and data[self.pred_text_key].

Note

We check for exact matches, so you need to be mindful of spaces, e.g. you may wish to do substrings_in_insertion = ["nemo "] instead of substrings_in_insertion = ["nemo"].

Parameters:
  • substrings_in_insertion (list[str]) – a list of strings which might be inserted in predicted ASR text. If the insertion matches a string exactly, the utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropHighCER[source]#

Drops utterances if there is a sufficiently high character-error-rate (CER).

CER is measured between data[self.text_key] and data[self.pred_text_key].

Note

We only drop the utterance if CER > threshold (i.e. strictly greater than) so that if we set the threshold to 0, we will not remove utterances with CER == 0.

Parameters:
  • cer_threshold (float) – CER threshold above which the utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropHighWER[source]#

Drops utterances if there is a sufficiently high word-error-rate (WER).

WER is measured between data[self.text_key] and data[self.pred_text_key].

Note

We only drop the utterance if WER > threshold (i.e. strictly greater than) so that if we set the threshold to 0, we will not remove utterances with WER == 0.

Parameters:
  • wer_threshold (float) – WER threshold above which the utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropLowWordMatchRate[source]#

Drops utterances if there is a sufficiently low word-match-rate (WMR).

WMR is measured between data[self.text_key] and data[self.pred_text_key].

Note

We only drop the utterance if WMR < threshold (i.e. strictly lower than) so that if we set the threshold to 100, we will not remove utterances with WMR == 100.

Parameters:
  • wmr_threshold (float) – WMR threshold below which the utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

  • pred_text_key (str) – a string indicating which key of the data entries should be used to access the ASR predictions. Defaults to “pred_text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropHighLowCharrate[source]

Drops utterances if their character rate is too low or too high.

Character rate = (num of characters in self.text_key) / (duration of audio). A too-low or too-high character rate often implies that the ground truth transcription might be inaccurate.

Parameters:
  • high_charrate_threshold (float) – upper character rate threshold. If the character rate of an utterance is higher than this number, the utterance will be dropped.

  • low_charrate_threshold (float) – lower character rate threshold. If the character rate of an utterance is lower than this number, the utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropHighLowWordrate[source]#

Drops utterances if their word rate is too low or too high.

Word rate = (num of words in self.text_key) / (duration of audio). A too-low or too-high word rate often implies that the ground truth transcription might be inaccurate.

Parameters:
  • high_wordrate_threshold (float) – upper word rate threshold. If the word rate of an utterance is higher than this number, the utterance will be dropped.

  • low_wordrate_threshold (float) – lower word rate threshold. If the word rate of an utterance is lower than this number, the utterance will be dropped.

  • text_key (str) – a string indicating which key of the data entries should be used to find the utterance transcript. Defaults to “text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropHighLowDuration[source]#

Drops utterances if their duration is too low or too high.

Parameters:
  • high_duration_threshold (float) – upper duration threshold (in seconds). If the duration of an utterance’s audio is higher than this number, the utterance will be dropped.

  • low_duration_threshold (float) – lower duration threshold (in seconds). If the duration of an utterance’s audio is lower than this number, the utterance will be dropped.

  • duration_key (str) – a string indicating which key of the data entries should be used to find the utterance duration. Defaults to “duration”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropRepeatedFields[source]#

Drops utterances from the current manifest if their text fields are present in other manifests.

This class processes multiple manifest files and removes entries from the current manifest if the text field matches any entry in the other manifests. It allows for optional punctuation removal from the text fields before performing the check.

Note

It is better to process Test/Dev/Train and then Other.tsv

Parameters:
  • manifests_paths (list[str]) – List of paths to the manifest files to check against.

  • current_manifest_file (str) – Path to the current manifest file to be processed.

  • punctuations (str) – (Optional): String of punctuation characters to be removed from the text fields before checking for duplicates. Defaults to None.

  • text_key (str) – The key in the manifest entries that contains the text field. Defaults to “text”.

Returns:

The same data as in the input manifest with some entries dropped.

sdp.processors.DropDuplicates[source]#

Processor that drops all the non unique uterances associated with the specified key, keeping only the first utterance.

Parameters:
  • drop_key (str) – A string specifying the key in the data entries used to determine uniqueness. Defaults to “text”.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseProcessor.

Returns:

A list of unique data entries after removing duplicates.

sdp.processors.AcceptIfWERLess[source]#

This processor accepts Toloka assignments if the Word Error Rate (WER) is below a threshold.

It evaluates the WER between ground truth and predicted text for each assignment and accepts those that meet the specified threshold criteria.

Parameters:
  • input_data_file (str) – Path to the input data file containing API configurations.

  • input_pool_file (str) – Path to the input pool file containing pool configurations.

  • threshold (float) – The WER threshold below which assignments are accepted. Default: 75.

  • config_file (str) – Path to the configuration file. Default: None.

  • API_KEY (str) – The API key for authenticating with Toloka’s API. Default: None.

  • platform (str) – The Toloka platform to use. Default: None.

  • pool_id (str) – The ID of the Toloka pool. Default: None.

Returns:

A manifest with accepted assignments from Toloka based on the WER threshold.

Example: .. code-block:: yaml

  • _target_: sdp.processors.toloka.accept_if.AcceptIfWERLess

    input_manifest_file: ${workspace_dir}/result_manifest_pred_clean.json output_manifest_file: ${workspace_dir}/result_manifest_pred_review.json input_data_file: ${workspace_dir}/data_file.json input_pool_file: ${workspace_dir}/taskpool.json threshold: 50

sdp.processors.CreateTolokaPool[source]#

Creates a Toloka pool for a given project based on user-provided configurations.

This class connects to Toloka, loads necessary settings, creates a new pool, and optionally sets up quality control mechanisms for worker submissions.

Parameters:
  • lang (str) – The language filter for the pool. Default: ‘HY’.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

Returns:

A newly created pool on the Toloka platform, configured and ready for task assignment.

sdp.processors.CreateTolokaProject[source]#

Creates a Toloka project based on user-provided configurations.

This class connects to Toloka, configures a new project with a name, description, and instructions, and saves the created project details for future use.

Parameters:
  • project_name (str) – The name of the project to be created.

  • project_description (str) – A description shown to Toloka workers about the project.

  • project_instructions (str) – Instructions provided to workers on how to complete assigned tasks.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

Returns:

A project created on the Toloka platform, configured and ready for task and pool setup.

sdp.processors.CreateSentenceSet[source]#

Creates a set of sentences from a DOCX file by splitting its content into individual sentences.

This processor reads a DOCX file, extracts the full text, splits it into sentences based on the Armenian period character, and wraps each sentence into a DataEntry.

Parameters:

**kwargs – Additional arguments passed to the base BaseParallelProcessor class.

Returns:

A list of DataEntry objects, each containing a single extracted sentence.

sdp.processors.CreateTolokaTaskSet[source]#

Creates a set of tasks in a Toloka pool based on user-provided configurations and input data.

This class reads data from a manifest file, loads the target pool configuration, and uses Toloka’s API to create and upload tasks into the specified pool.

Parameters:
  • input_data_file (str) – Path to the input data file containing API configurations.

  • input_pool_file (str) – Path to the input pool file containing pool configurations.

  • limit (float) – Percentage of tasks to load from the manifest file. Default: 100.

Returns:

A set of tasks created and uploaded to the specified Toloka pool.

sdp.processors.GetTolokaResults[source]#

Fetches and stores results from a specified Toloka pool based on user-configured conditions.

This class connects to Toloka, retrieves task results from a specified pool, filters them by assignment status, and stores the results in the given output directory.

Parameters:
  • input_data_file (str) – Path to the input data file containing API configurations.

  • input_pool_file (str) – Path to the input pool file containing pool configurations.

  • output_dir (str) – Directory where the results will be stored.

  • status (str) – Status filter for assignments to retrieve (default: ‘ACCEPTED’).

  • config_file (str) – Path to a configuration file. Default: None.

  • API_KEY (str) – The API key for authenticating with Toloka’s API. Default: None.

  • platform (str) – The Toloka environment to use (‘PRODUCTION’ or ‘SANDBOX’). Default: None.

  • pool_id (str) – The ID of the Toloka pool to retrieve results from. Default: None.

Returns:

A set of task results from Toloka, stored in the specified output directory.

sdp.processors.RejectIfBanned[source]#

Rejects Toloka assignments if the user is banned.

This class connects to Toloka, checks the user’s ban status, and rejects any assignments from users who are identified as banned.

Parameters:
  • input_data_file (str) – Path to the input data file containing API configurations.

  • input_pool_file (str) – Path to the input pool file containing pool configurations.

  • config_file (str) – Path to the configuration file. Default: None.

  • API_KEY (str) – The API key for authenticating with Toloka’s API. Default: None.

  • platform (str) – The Toloka environment to use (‘PRODUCTION’ or ‘SANDBOX’). Default: None.

  • pool_id (str) – The ID of the Toloka pool to retrieve assignments from. Default: None.

Returns:

A list of rejected assignments for users who are banned.

sdp.processors.DetectWhisperHallucinationFeatures[source]#

Computes hallucination-related features for ASR model outputs (e.g., Whisper transcripts).

This processor analyzes the transcript text and flags common hallucination patterns by computing boolean features such as: - Repeated or low-diversity n-grams (hall_repeated_ngrams)

Example:

yes yes yes yes yes yes yes yes yes yes yes yes
  • Unusually long or disproportionately long words (hall_long_word)

    Example:

    short mid reallyreallyreallyreallyreallyreallyreallylong
    
  • Matches with known hallucinated phrases (hall_frequent_single_word)

    Example:

    lorem ipsum dolor sit amet
    

It appends these features to each entry in the manifest for downstream filtering or analysis.

Parameters:
  • common_hall_file (str) – Path to a file with known hallucinated phrases, one per line.

  • unique_words_threshold (float) – Maximum allowed share of unique words before marking as repeated. Default is 0.4.

  • long_word_threshold (int) – Minimum character length for a word to be considered “long”. Default is 25.

  • long_word_rel_threshold (float) – Relative length ratio between the longest and second-longest word. Default is 3.

  • char_rate_threshold (float) – [Unused in current logic, retained for compatibility]. Default is 4.

  • text_field (str) – Key in the data entry that contains the transcript. Default is ‘text’.

  • **kwargs – Additional keyword arguments passed to BaseParallelProcessor.

Returns:

A manifest with additional boolean fields for hallucination detection.

sdp.processors.CleanQwenGeneration[source]#

A processor that filters and post-processes model generations, replacing them with reference text if they are considered low quality based on character error rate (CER) and uppercase letter proportion.

This processor is typically used after running a generation model (e.g., Qwen) to clean up outputs and ensure alignment with reference transcriptions.

Parameters:
  • cer_threshold (float) – Maximum allowable character error rate (CER) between the normalized generation and reference text. If exceeded, the generation is replaced by the reference.

  • upper_case_threshold (float) – Threshold for the proportion of uppercase letters in the generation. If the ratio exceeds this value, the generation is replaced.

  • generation_field (str) – Key in the input manifest for the model-generated text.

  • text_field (str) – Key in the input manifest for the reference (target) text.

  • **kwargs – Additional arguments passed to the BaseParallelProcessor.

Returns:

A manifest where each entry contains the cleaned generation in the specified generation field. If a replacement occurred, it is recorded in the metrics.

Metrics:
  • 1 if the generation was replaced with the reference text.

  • 0 if the generation was left as-is.

sdp.processors.GetRttmSegments[source]#

This processor extracts audio segments based on RTTM (Rich Transcription Time Marked) files.

The class reads an RTTM file specified by the rttm_key in the input data entry and generates a list of audio segment start times. It ensures that segments longer than a specified duration threshold are split into smaller segments. The resulting segments are stored in the output data entry under the output_file_key.

Parameters:
  • rttm_key (str) – The key in the manifest that contains the path to the RTTM file.

  • output_file_key (str, optional) – The key in the data entry where the list of audio segment start times will be stored. Defaults to “audio_segments”.

  • duration_key (str, optional) – The key in the data entry that contains the total duration of the audio file. Defaults to “duration”.

  • duration_threshold (float, optional) – The maximum duration for a segment before it is split. Segments longer than this threshold will be divided into smaller segments. Defaults to 20.0 seconds.

Returns:

A list containing a single DataEntry object with the updated data entry, which includes the output_file_key containing the sorted list of audio segment start times.

sdp.processors.SplitAudioFile[source]#

This processor splits audio files into segments based on provided timestamps.

The class reads an audio file specified by the input_file_key and splits it into segments based on the timestamps provided in the segments_key field of the input data entry. The split audio segments are saved as individual WAV files in the specified splited_audio_dir directory. The output_file_key field of the data entry is updated with the path to the corresponding split audio file, and the duration_key field is updated with the duration of the split audio segment.

Parameters:
  • splited_audio_dir (str) – The directory where the split audio files will be saved.

  • segments_key (str, optional) – The key in the manifest that contains the list of timestamps for splitting the audio. Defaults to “audio_segments”.

  • duration_key (str, optional) – The key in the manifest where the duration of the split audio segment will be stored. Defaults to “duration”.

  • input_file_key (str, optional) – The key in the manifest that contains the path to the input audio file. Defaults to “source_filepath”.

  • output_file_key (str, optional) – The key in the manifest where the path to the split audio file will be stored. Defaults to “audio_filepath”.

Returns:

A list of data entries, where each entry represents a split audio segment with the corresponding file path and duration updated in the data entry.

Miscellaneous#

sdp.processors.AddConstantFields[source]#

This processor adds constant fields to all manifest entries using Dask BaseParallelProcessor. It is useful when you want to attach fixed information (e.g., a language label or metadata) to each entry for downstream tasks such as language identification model training.

Parameters:

fields (dict) –

A dictionary containing key-value pairs of fields to add to each manifest entry. For example:

{
    "label": "en",
    "metadata": "mcv-11.0-2022-09-21"
}

Returns:

The same data as in the input manifest with the added constant fields as specified in the fields dictionary.

Return type:

dict

Example

- _target_: sdp.processors.modify_manifest.common.AddConstantFields
  input_manifest_file: ${workspace_dir}/input_manifest.json
  output_manifest_file: ${workspace_dir}/output_manifest.json
  fields:
    label: "en"
    metadata: "mcv-11.0-2022-09-21"
sdp.processors.CombineSources[source]#

Can be used to create a single field from two alternative sources.

E.g.:

_target_: sdp.processors.CombineSources
sources:
    - field: text_pc
      origin_label: original
    - field: text_pc_pred
      origin_label: synthetic
    - field: text
      origin_label: no_pc
target: text

will populate the text field with data from text_pc field if it’s present and not equal to n/a (can be customized). If text_pc is not available, it will populate text from text_pc_pred field, following the same rules. If both are not available, it will fall back to the text field itself. In all cases it will specify which source was used in the text_origin field by using the label from the origin_label field.. If non of the sources is available, it will populate both the target and the origin fields with n/a.

Parameters:
  • sources (list[dict]) –

    list of the sources to use in order of preference. Each element in the list should be in the following format:

    {
        field: <which field to take the data from>
        origin_label: <what to write in the "<target>_origin"
    }
    

  • target (str) – target field that we are populating.

  • na_indicator (str) – if any source field has text equal to the na_indicator it will be considered as not available. If none of the sources are present, this will also be used as the value for the target and origin fields. Defaults to n/a.

Returns:

The same data as in the input manifest enhanced with the following fields:

<target>: <populated with data from either <source1> or <source2>                        or with <na_indicator> if none are available>
<target>_origin: <label that marks where the data came from>

sdp.processors.DuplicateFields[source]#

This processor duplicates fields in all manifest entries.

It is useful for when you want to do downstream processing of a variant of the entry. E.g. make a copy of “text” called “text_no_pc”, and remove punctuation from “text_no_pc” in downstream processors.

Parameters:

duplicate_fields (dict) – dictionary where keys are the original fields to be copied and their values are the new names of the duplicate fields.

Returns:

The same data as in the input manifest with duplicated fields as specified in the duplicate_fields input dictionary.

Example

- _target_: sdp.processors.modify_manifest.common.DuplicateFields
  input_manifest_file: ${workspace_dir}/test1.json
  output_manifest_file: ${workspace_dir}/test2.json
  duplicate_fields: {"text":"answer"}
sdp.processors.RenameFields[source]#

This processor renames fields in all manifest entries.

Parameters:

rename_fields – dictionary where keys are the fields to be renamed and their values are the new names of the fields.

Returns:

The same data as in the input manifest with renamed fields as specified in the rename_fields input dictionary.

sdp.processors.SplitOnFixedDuration[source]#

This processor splits audio into a fixed length segments.

It does not actually create different audio files, but simply adds corresponding offset and duration fields. These fields can be automatically processed by NeMo to split audio on the fly during training.

Parameters:
  • segment_duration (float) – fixed desired duration of each segment.

  • drop_last (bool) – whether to drop the last segment if total duration is not divisible by desired segment duration. If False, the last segment will be of a different length which is < segment_duration. Defaults to True.

  • drop_text (bool) – whether to drop text from entries as it is most likely inaccurate after the split on duration. Defaults to True.

Returns:

The same data as in the input manifest but all audio that’s longer than the segment_duration will be duplicated multiple times with additional offset and duration fields. If drop_text=True will also drop text field from all entries.

sdp.processors.ChangeToRelativePath[source]#

This processor changes the audio filepaths to be relative.

Parameters:

base_dir – typically a folder where manifest file is going to be stored. All passes will be relative to that folder.

Returns:

The same data as in the input manifest with audio_filepath key changed to contain relative path to the base_dir.

sdp.processors.SortManifest[source]#

Processor which will sort the manifest by some specified attribute.

Parameters:
  • attribute_sort_by (str) – the attribute by which the manifest will be sorted.

  • descending (bool) – if set to False, attribute will be in ascending order. If True, attribute will be in descending order. Defaults to True.

Returns:

The same entries as in the input manifest, but sorted based on the provided parameters.

sdp.processors.KeepOnlySpecifiedFields[source]#

Saves a copy of a manifest but only with a subset of the fields.

Typically will be the final processor to save only relevant fields in the desired location.

Parameters:

fields_to_keep (list[str]) – list of the fields in the input manifest that we want to retain. The output file will only contain these fields.

Returns:

The same data as in input manifest, but re-saved in the new location with only fields_to_keep fields retained.

sdp.processors.GetAudioDuration[source]#

Processor that computes the duration of the file in audio_filepath_key (using soundfile) and saves the duration in duration_key. If there is an error computing the duration, the value at duration_key will be updated with the value -1.0.

Parameters:
  • audio_filepath_key (str) – Key to get path to wav file.

  • duration_key (str) – Key to put to audio duration.

Returns:

All the same fields as in the input manifest plus duration_key

sdp.processors.CreateInitialManifestByExt[source]#

Processor for creating an initial dataset manifest by saving filepaths with a common extension to the field specified in output_field.

Parameters:
  • raw_data_dir (str) – The root directory of the files to be added to the initial manifest. This processor will recursively look for files with the extension ‘extension’ inside this directory.

  • output_file_key (str) – The key to store the paths to the files in the dataset.

  • extension (str) – The file extension of the of the files to be added to the manifest.

  • **kwargs – Additional keyword arguments to be passed to the base class BaseParallelProcessor.

sdp.processors.ApplyInnerJoin[source]#

Applies inner join to two manifests, i.e. creates a manifest from records that have matching values in both manifests. For more information, please refer to the Pandas merge function documentation: https://pandas.pydata.org/docs/reference/api/pandas.merge.html#pandas.merge

Parameters:
  • column_id (Union[str, List[str], None]) – Field names to join on. These must be found in both manifests. If column_id is None then this defaults to the intersection of the columns in both manifests. Defaults to None.

  • left_manifest_file (Optional[str]) – path to the left manifest. Defaults to input_manifest_file.

  • right_manifest_file (str) – path to the right manifest.

Returns:

Inner join of two manifests.

sdp.processors.CreateCombinedManifests[source]#

Reads JSON lines from specified files and creates a combined manifest.

This processor iterates over files listed in manifest_list, reads each file line by line, and yields the parsed JSON data from each line.

Parameters:
  • manifest_list (list(str)) – A list of file paths or directories to process. The processor will recursively read files within the directories and expect each file to contain JSON data.

  • **kwargs – Additional keyword arguments passed to the base class BaseParallelProcessor.

Returns:

A generator that yields parsed JSON data from each line in the files listed in manifest_list.

sdp.processors.tts.split.SplitLongAudio[source]#

This processor splits long audio files into smaller segments.

It processes audio files that exceed a specified maximum length by splitting them into smaller segments at natural pauses in the audio to maintain speech coherence.

Parameters:
  • suggested_max_len (float) – Target maximum length for audio segments in seconds. Defaults to 3600

  • min_pause_len (float) – Minimum length of a pause to consider for splitting in seconds. Defaults to 1.0

  • min_len (float) – Minimum length for any split segment in seconds. Defaults to 1.0

Returns:

The same data as in the input manifest, but with long audio files split into multiple segments with updated paths and durations.

Example

- _target_: sdp.processors.tts.split.SplitLongAudio
  input_manifest_file: ${workspace_dir}/manifest.json
  output_manifest_file: ${workspace_dir}/manifest_split.json
  suggested_max_len: 3600
sdp.processors.tts.split.JoinSplitAudioMetadata[source]#

A processor for joining metadata of previously split audio files.

This processor combines the metadata (transcripts and alignments) of audio files that were previously split by the SplitLongAudio processor. It adjusts timestamps and concatenates transcripts to recreate the original audio’s metadata.

Parameters:

None

Returns:

The same data as in the input manifest, but with split audio files joined together.

sdp.processors.tts.merge_alignment_diarization.MergeAlignmentDiarization[source]#

This processor merges alignment and diarization information from a manifest file.

It takes a manifest file containing both alignment and diarization information and merges the alignment information into the diarization segments.

Parameters:

None

Returns:

The same data as in the input manifest, but with alignment information merged into the diarization segments.

Example

- _target_: sdp.processors.tts.merge_alignment_diarization.MergeAlignmentDiarization
  input_manifest_file: ${workspace_dir}/manifest.json
  output_manifest_file: ${workspace_dir}/manifest_merged.json
sdp.processors.tts.prepare_tts_segments.PrepareTTSSegmentsProcessor[source]#

This processor merges adjacent segments from the same speaker and splits segments to have a complete utterance.

It processes segments by merging those from the same speaker that are adjacent, then splits segments based on duration limits, punctuation marks, and audio quality metrics like bandwidth.

Parameters:
  • min_duration (float) – Minimum duration in seconds for a segment. Defaults to 5

  • max_duration (float) – Maximum duration in seconds for a segment. Defaults to 20

  • max_pause (float) – Maximum pause duration in seconds between merged segments. Defaults to 2

  • terminal_punct_marks (str) – String containing punctuation marks to split on. Defaults to “.!?。??!。”

  • punctuation_split_only (bool) – Whether to only split on punctuation. Defaults to False

Returns:

The same data as in the input manifest, but with segments merged and split according to the specified parameters.

Example

- _target_: sdp.processors.tts.prepare_tts_segments.PrepareTTSSegmentsProcessor
  input_manifest_file: ${workspace_dir}/manifest.json
  output_manifest_file: ${workspace_dir}/manifest_processed.json
  min_duration: 5
  max_duration: 20
sdp.processors.ipl.nemo_run_processor.NemoRunIPLProcessor[source]#

A processor that handles Iterative Pseudo-Labeling (IPL) training workflow.

Parameters:
  • config_path (str) – Path to the YAML configuration file containing IPL settings

  • output_manifest_file (str) – Path where the output manifest file will be written

  • input_manifest_file (str, Optional) – Path to the input manifest file

sdp.processors.ipl.ipl_processors.TrainingCommandGenerator[source]#

A processor that generates training commands for NeMo models with support for both local and cluster configurations. Handles manifest file updates and tarred audio filepath management for training datasets.

Parameters:
  • training_config_local (str) – Path to the local machine configuration file

  • training_config_cluster (str) – Path to the cluster configuration file

  • training_script_path (str) – Path to the training script relative to nemo_directory

  • nemo_directory (str) – Base directory for NeMo framework

  • new_manifest_files (str, Optional) – New manifest files to add to the training configuration

  • new_tarred_audio_filepaths (str, Optional) – New tarred audio filepaths to add to the training configuration

  • **kwargs – Additional arguments passed to the parent BaseProcessor class

sdp.processors.ipl.ipl_processors.InferenceCommandGenerator[source]#

A processor that generates inference commands for pseudo-labeling.

Parameters:
  • nemo_directory (str) – Base directory for NeMo framework

  • inference_local_config (str) – Path to the local configuration file

  • inference_config_paths (str) – Path to the inference configuration files

  • manifests (str) – Path to the manifest files

  • p_cache (float) – What part of pseudo-labels to update

  • num_gpus (int) – Number of GPUs to use

  • is_tarred (bool) – Whether the audio is tarred

  • first_run (bool) – Whether this is the first run of pseudo-labeling

  • **kwargs – Additional arguments passed to the parent BaseProcessor class

sdp.processors.DropSpecifiedFields[source]#

A processor that removes specified fields from each data entry in the manifest.

This processor reads an input manifest line by line, drops the fields listed in fields_to_drop from each JSON entry, and writes the cleaned entries to the output manifest.

Parameters:
  • fields_to_drop (List[str]) – A list of keys to remove from each manifest entry.

  • **kwargs – Additional arguments passed to the BaseProcessor (e.g., input/output manifest paths).

Returns:

A line-delimited JSON manifest, where each entry is the same as the input, but with the specified fields removed.

Base classes#

This section lists all the base classes you might need to know about if you want to add new SDP processors.

BaseProcessor#

class sdp.processors.base_processor.BaseProcessor(output_manifest_file: str, input_manifest_file: str | None = None, **kwargs)[source]#

Bases: ABC

Abstract class for SDP processors.

All processor classes inherit from the BaseProcessor class. This is a simple abstract class which has 2 empty methods: process() and test().

These serve to remind us that SDP essentially just runs .test() on all processors (to implement run-time tests), and then .process() on all processors.

Parameters:
  • output_manifest_file (str) – path of where the output manifest file will be located. Cannot have the same value as input_manifest_file.

  • input_manifest_file (str) – path of where the input manifest file is located. This arg is optional - some processors may not take in an input manifest because they need to create an initial manifest from scratch (ie from some transcript file that is in a format different to the NeMo manifest format). Cannot have the same value as input_manifest_file.

abstractmethod process()[source]#

Should be overriden by the child classes to implement some data processing.

test()[source]#

This method can be used to perform “runtime” tests.

This can be any kind of self-consistency tests, but are usually in the form of checking that provided input test data entries match provided output test data entries.

There are not tests by default.

BaseParallelProcessor#

class sdp.processors.base_processor.BaseParallelProcessor(input_manifest_file: str | None = None, output_manifest_file: str | None = None, max_workers: int = -1, chunksize: int = 100, in_memory_chunksize: int = 100000, test_cases: List[Dict] | None = None, use_dask: bool = True, dask_client=None, **kwargs)[source]#

Bases: BaseProcessor

A processor that performs per-entry processing in parallel (using Dask or multiprocessing).

Parameters:
  • input_manifest_file (str) – Path to the input manifest file.

  • output_manifest_file (str) – Path where the output manifest file will be written.

  • max_workers (int) – Maximum number of workers.

  • chunksize (int) – Chunk size used for parallel routines.

  • in_memory_chunksize (int) – Maximum number of entries to load at once.

  • test_cases (list[dict]) – Optional list of test cases.

  • use_dask (bool) – If True, use Dask for parallelization; otherwise, use multiprocessing.

  • dask_client – (Optional) An existing Dask client.

prepare()[source]#

Can be used in derived classes to prepare the processing.

process()[source]#

A fork in the road to pick dask or classic processing

_process_with_dask(metrics)[source]#
_process_with_multiprocessing(metrics)[source]#
_chunk_manifest()[source]#

Splits the input manifest into chunks of in_memory_chunksize size. Only used in non-Dask (multiprocessing) mode.

read_manifest()[source]#

Reads entries from the input manifest.

Behavior depends on the parallelization mode:
  • When use_dask is True:

    If the input_manifest_file exists and is non-empty, returns a Dask bag (reading in 256KB blocks). Otherwise, logs the condition and returns an empty Dask bag.

  • When use_dask is False:

    If the input_manifest_file does not exist or is empty, logs the condition and returns an empty iterator. Otherwise, opens the file in text mode, strips each line, and yields the parsed JSON from non-empty lines.

This unified behavior lets the processor run even in manifest-creation mode.

abstractmethod process_dataset_entry(data_entry) List[Any][source]#

Must be implemented in derived classes. For each data entry, return a list of DataEntry objects.

finalize(metrics: List[Any])[source]#

Outputs metrics about the processed data.

test()[source]#

Applies processing to each test case and raises an error if the output does not match expected output.

Runtime tests#

Before running the specified processors, SDP runs processor.test() on all specified processors. A test method is provided in sdp.processors.base_processor.BaseParallelProcessor.test(), which checks that for a given input data entry, the output data entry/entries produced by the processor will match the expected output data entry/entries. Note that this essentially only checks that the impact on the data manifest will be as expected. If you want to do some other checks, you will need to override this test method.

The input data entry and the expected output data entry/entries for sdp.processors.base_processor.BaseParallelProcessor.test() are specified inside the optional list of test_cases that were provided in the object constructor. This means you can provided test cases in the YAML config file, and the dataset will only be processed if the test cases pass.

This is helpful to (a) make sure that the rules you wrote have the effect you desired, and (b) demonstrate why you wrote those rules. An example of test cases we could include in the YAML config file:

- _target_: sdp.processors.DropIfRegexMatch
regex_patterns:
    - "(\\D ){5,20}" # looks for between 4 and 19 characters surrounded by spaces
test_cases:
    - {input: {text: "some s p a c e d out letters"}, output: null}
    - {input: {text: "normal words only"}, output: {text: "normal words only"}}